Introduction to the Probabilistic Data Structure
Based on different properties such as speed, cost, and ease of use(as a developer), etc. the below information represents different ways of storing stuff in the computer machine.
It means memory is faster than SSD than HDD than Tape and the same goes with cost and ease of use as a developer.
Storage and its limitations
Now let’s discuss the scenario with the context of the developer. If we want to store some stuff in memory then we may use Set(of course one can use other in-memory data structure as well like Arrays, List, Map, etc) and if we want to store some data on SSD then we may use something like a relational database or elastic search. Similarly for a hard drive(HDD) we can use Hadoop(HDFS).
Now suppose we want to store data in memory using deterministic in-memory data structure but the problem is the amount of memory we have on servers in terms of GB or TB for memory is less than SSD and SSD might have memory lesser than a hard drive(HDD), and also one should remember than deterministic data structure is good and popular to use but these data structures are not efficient in term of consuming memory.
HDD<-------SSD<-------Memory //Storage per node
Now the question is how can we do more stuff at the memory side, with less amount of memory consumption?
HDD-------SSD-------Memory ^ | How can we do more stuff here?
Thus this is the place where probabilistic data structure comes into the picture which can do almost the same job as a deterministic data structure but with a lot less memory.
Deterministic Vs Probabilistic Data Structure
Being an IT professional, we might have come across many deterministic data structures such as Array, List, Set, HashTable, HashSet, etc. These in-memory data structures are the most typical data structures on which various operations such as insert, find and delete could be performed with specific key values. As a result of operation what we get is the deterministic(accurate) result. But this is not in the case of a probabilistic data structure, Here the result of operation could be probabilistic(may not give you a definite answer, always results in approximate), and hence named as a probabilistic data structure. We will see and prove this in the coming sections. But for now let’s dig into more detail of its definition, types, and uses.
How does it work?
Probabilistic data structure works with large data set, where we want to perform some operations such as finding some unique items in given data set or it could be finding the most frequent item or if some items exist or not. To do such an operation probabilistic data structure uses more and more hash functions to randomize and represent a set of data.
The more number of hash function the more accurate result.
Things to remember
A deterministic data structure can also perform all the operations that a probabilistic data structure does but only with low data sets. As stated earlier, if the data set is too big and couldn’t fit into the memory, then the deterministic data structure fails and is simply not feasible. Also in case of a streaming application where data is required to be processed in one go and perform incremental updates, it is very difficult to manage with the deterministic data structure.
- Analyze big data set
- Statistical analysis
- Mining tera-bytes of data sets, etc
Popular probabilistic data structures
- Bloom filter
- Count-Min Sketch