Open In App

Bloom Filter in Java with Examples

Bloom filters are for set membership which determines whether an element is present in a set or not. Bloom filter was invented by Burton H. Bloom in 1970 in a paper called Space/Time Trade-offs in Hash Coding with Allowable Errors (1970). Bloom filter is a probabilistic data structure that works on hash-coding methods (similar to HashTable).

When do we need a Bloom Filter?
Consider any of the following situations:



  1. Suppose we have a list of some elements and we want to check whether a given element is present or not?
  2. Consider you are working on email service and you are trying to implement sign up endpoint with a feature that a given username is already present or not?
  3. Suppose you have given a set of blacklisted IP’s and you want to filter out a given IP is a blacklisted one or not?

Can these problem be solved without the help of Bloom Filter?

Let us try to solve these problem using a HashSet




import java.util.HashSet;
import java.util.Set;
  
public class SetDemo {
    public static void main(String[] args)
    {
        Set<String> blackListedIPs
            = new HashSet<>();
        blackListedIPs.add("192.170.0.1");
        blackListedIPs.add("75.245.10.1");
        blackListedIPs.add("10.125.22.20");
  
        // true
        System.out.println(
            blackListedIPs
                .contains(
                    "75.245.10.1"));
  
        // false
        System.out.println(
            blackListedIPs
                .contains(
                    "101.125.20.22"));
    }
}

Output:

true
false

Why does data structure like HashSet or HashTable fail?
HashSet or HashTable works well when we have limited data set, but might not fit as we move with a large data set. With a large data set, it takes a lot of time with a lot of memory.

Size of Data set vs insertion time for HashSet like data structure

----------------------------------------------
|Number of UUIDs          Insertion Time(ms) |
----------------------------------------------
|10                         <1               |
|100                         3               |
|1, 000                      58              |
|10, 000                     122             |
|100, 000                    836             |
|1, 000, 000                 7395            |
----------------------------------------------

Size of Data set vs memory (JVM Heap) for HashSet like data structure

----------------------------------------------
|Number of UUIDs            JVM heap used(MB) |
----------------------------------------------
|10                         <2                |   
|100                        <2                |
|1, 000                      3                |
|10, 000                     9                |
|100, 000                    37               |
|1, 000, 000                 264              |
-----------------------------------------------

So it is clear that if we have a large set of data then a normal data structure like the Set or HashTable is not feasible, and here Bloom filters come into the picture. Refer this article for more details on comparison between the two: Difference between Bloom filters and Hashtable

How to solve these problems with the help of Bloom Filter?
Let’s take a bit array of size N (Here 24) and initialize each bit with binary zero, Now take some hash functions (You can take as many you want, we are taking two hash function here for our illustration).

Why is Bloom Filter a probabilistic data structure?
Let’s understand this with one more test, This time consider an IP 101.125.20.22 and check whether it is present in the set or not. Pass this to both hash function. Consider our hash function results as follows.

hashFunction_1(101.125.20.22) : 19 
hashFunction_2(101.125.20.22) : 2

Now, visit the index 19 and 2 which is set to 1 and it says that the given IP101.125.20.22 is present in the set.

But, this IP 101.125.20.22 has bot been processed above in the data set while adding the IP’s to bit array. This is known as False Positive:

Expected Output: No
Actual Output: Yes (False Positive)

In this case, index 2 and 19 were set to 1 by other input and not by this IP 101.125.20.22. This is called collision and that’s why it is probabilistic, where chances of happening are not 100%.

What to expect from a Bloom filter?

  1. When a Bloom filter says an element is not present it is for sure not present. It guarantees 100% that the given element is not available in the set, because either of the bit of index given by hash functions will be set to 0.
  2. But when Bloom filter says the given element is present it is not 100% sure, because there may be a chance due to collision all the bit of index given by hash functions has been set to 1 by other inputs.

How to get 100% accurate result from a Bloom filter?
Well, this could be achieved only by taking more number of hash functions. The more number of the hash function we take, the more accurate result we get, because of lesser chances of a collision.

Time and Space complexity of a Bloom filter
Suppose we have around 40 million data sets and we are using around H hash functions, then:

Time complexity: O(H), where H is the number of hash functions used
Space complexity: 159 Mb (For 40 million data sets)
Case of False positive: 1 mistake per 10 million (for H = 23)

Implementing Bloom filter in Java using Guava Library:
We can implement the Bloom filter using Java library provided by Guava.

  1. Include the below maven dependency:




    <dependency>
        <groupId>com.google.guava</groupId>
        <artifactId>guava</artifactId>
        <version>19.0</version>
    </dependency>
    
    
  2. Write the following code to implement the Bloom Filter:




    // Java program to implement
    // Bloom Filter using Guava Library
      
    import java.nio.charset.Charset;
    import com.google.common.hash.BloomFilter;
    import com.google.common.hash.Funnels;
      
    public class BloomFilterDemo {
        public static void main(String[] args)
        {
      
            // Create a Bloom Filter instance
            BloomFilter<String> blackListedIps
                = BloomFilter.create(
                    Funnels.stringFunnel(
                        Charset.forName("UTF-8")),
                    10000);
      
            // Add the data sets
            blackListedIps.put("192.170.0.1");
            blackListedIps.put("75.245.10.1");
            blackListedIps.put("10.125.22.20");
      
            // Test the bloom filter
            System.out.println(
                blackListedIps
                    .mightContain(
                        "75.245.10.1"));
            System.out.println(
                blackListedIps
                    .mightContain(
                        "101.125.20.22"));
        }
    }
    
    

    Output:

    Bloom Filter Output

    Note: The above Java code may return a 3% false-positive probability by default.

  3. Reduce the false-positive probability
    Introduce another parameter in Bloom filter object creation as follows:

    BloomFilter blackListedIps = BloomFilter.create(Funnels.stringFunnel(Charset.forName("UTF-8")), 10000, 0.005);

    Now false-positive probability has been reduced from 0.03 to 0.005. But tweaking this parameter has an effect on the side of the bloom filter.

Effect of reducing the false positive probability:
Let’s analyze this effect with respect to the hash function, array bit, time complexity and space complexity.

Conclusion:

Therefore it can be said that Bloom filter is a good choice in a situation where we have to process large data set with low memory consumption. Also, the more accurate result we want, the number of hash functions has to be increased.


Article Tags :