Bloom Filter

Keywords: Java Google Redis

I. overview

Simply put, a Bloom filter is a way to determine whether an element exists in a list. Usually in JAVA to determine whether there is a Map, Set and other containers. But when the amount of data is very large, using Map and Set will take up too much memory. At this point you'll consider using a Bloom filter.

Two, detailed explanation

To create a Bloom filter, you first need to declare a Bit array in memory, assuming that the length of the array is L and the initial value is all 0.

When put ting a key to the bloon filter, the key will be hashed n times, and then the hash value% l will be modeled to get n position subscripts. Then set all the values of the corresponding positions in the Bit array to 1. L and N depend on the total number of key estimates and the error rate, because bloomfilter does not guarantee 100% accuracy, as we will see later.

When judging whether a key exists or not, the key is also modeled N times hash. If the value of all positions in all bit arrays is 1, the key may exist. Note that this is possible.

The process is as follows: (assuming L is 10 and N is 3)

Hypothesis:

"zhangsan": the results of three hash modulus are: 0,2,4.

"lisi": The result of three hash modulus is 4,6,8.

"wangwu": the results of three hash modes are: 2,4,6.

If the two keys "zhangsan" and "lisi" already exist, then even if the key "wangwu" does not exist, the result returned by the algorithm exists, because the three locations of "zhangsan" and "lisi" have been occupied.

In summary, bloomfiter has some characteristics:

  1. If the algorithm return does not exist, then a key must not exist at that moment.
  2. If the algorithm returns to exist, it only indicates that it may exist.
  3. The key in bloomfiter cannot be deleted. Because bit bits are multiplexed, deletion can affect other keys.

So how to improve the accuracy of the algorithm?

  1. Increase the number of hash es (CPU and accuracy tradeoffs)
  2. Increase the length of bit arrays (memory and accuracy tradeoffs)

Three, implementation

  1. Write your own java code implementation
    package com.ikuboo.bloomfilter;
    
    import java.util.BitSet;
    
    /**
     * Bloom filter
     */
    public class MyBloomFilter {
    
        private int length;
    
        /**
         * bitset
         */
        private BitSet bitSet;
    
        public MyBloomFilter(int length) {
            this.length = length;
            this.bitSet = new BitSet(length);
        }
    
        /**
         * Write data
         */
        public void put(String key) {
            int first = hashcode_1(key);
            int second = hashcode_2(key);
            int third = hashcode_3(key);
    
            bitSet.set(first % length);
            bitSet.set(second % length);
            bitSet.set(third % length);
        }
    
        /**
         * Judging whether data exists
         *
         * @param key
         * @return true:Existence, false: nonexistence
         */
        public boolean exist(String key) {
            int first = hashcode_1(key);
            int second = hashcode_2(key);
            int third = hashcode_3(key);
    
            boolean firstIndex = bitSet.get(first % length);
            if (!firstIndex) {
                return false;
            }
            boolean secondIndex = bitSet.get(second % length);
            if (!secondIndex) {
                return false;
            }
            boolean thirdIndex = bitSet.get(third % length);
            if (!thirdIndex) {
                return false;
            }
            return true;
        }
    
        /**
         * hash Algorithm 1
         */
        private int hashcode_1(String key) {
            int hash = 0;
            int i;
            for (i = 0; i < key.length(); ++i) {
                hash = 33 * hash + key.charAt(i);
            }
            return Math.abs(hash);
        }
    
        /**
         * hash Algorithm 2
         */
        private int hashcode_2(String data) {
            final int p = 16777619;
            int hash = (int) 2166136261L;
            for (int i = 0; i < data.length(); i++) {
                hash = (hash ^ data.charAt(i)) * p;
            }
            hash += hash << 13;
            hash ^= hash >> 7;
            hash += hash << 3;
            hash ^= hash >> 17;
            hash += hash << 5;
            return Math.abs(hash);
        }
    
        /**
         * hash Algorithm 3
         */
        private int hashcode_3(String key) {
            int hash, i;
            for (hash = 0, i = 0; i < key.length(); ++i) {
                hash += key.charAt(i);
                hash += (hash << 10);
                hash ^= (hash >> 6);
            }
            hash += (hash << 3);
            hash ^= (hash >> 11);
            hash += (hash << 15);
            return Math.abs(hash);
        }
    
    }
    

    Test code

    public class TestMyBloomFilter {
        public static void main(String[] args) {
            int capacity = 10000000;
            MyBloomFilter bloomFilters = new MyBloomFilter(capacity);
            bloomFilters.put("key1");
    
            System.out.println("key1 Does it exist?:" + bloomFilters.exist("key1"));
            System.out.println("key2 Does it exist?:" + bloomFilters.exist("key2"));
        }
    }

     

  2. guava class library implementation
    import com.google.common.hash.BloomFilter;
    import com.google.common.hash.Funnels;
    
    import java.nio.charset.Charset;
    
    public class TestGuavaBloomFilter {
        public static void main(String[] args) {
            int capacity = 10000000;
    
            BloomFilter<String> bloomFilters = BloomFilter.create(
                    Funnels.stringFunnel(Charset.forName("UTF-8")), capacity, 0.01);
    
            bloomFilters.put("key1");
    
            System.out.println("key1 Does it exist?:" + bloomFilters.mightContain("key1"));
            System.out.println("key2 Does it exist?:" + bloomFilters.mightContain("key2"));
        }
    }

     

  3. Implementation with redis bitmap (TODO)

IV. Applicable Business Scenarios

  1. Preventing Cache Penetration
  2. Deduplication, idempotent processing, etc.

 

 

 

 

Posted by hmogan on Mon, 23 Sep 2019 04:17:13 -0700