Catalog
1. Concept of bloon filter
Bloom filter is a binary vector data structure proposed by Howard Bloom in 1970. It has good space and time efficiency. It is used to detect whether an element is a member of a set, that is, to determine "may already exist" and "absolutely not exist". If the detection result is yes, the element is not necessarily in the collection; but if the detection result is no, the element must not be in the collection, so bloom filter has a 100% recall rate.
2. Application scenario of bloon filter
- Spam filtering
- Prevent buffer breakdown
- Bitcoin transaction query
- URL filtering for Crawlers
- IP blacklist
- Query acceleration [such as data based on KV structure]
- The judgment of repeated set elements
3. Working principle of bloon filter
The core of the bloon filter is a very large set of bits and several hash functions. Suppose the length of the digit group is m and the number of hash functions is k.
The following figure shows that there are three hash functions. For example, there are three elements x, y and z in a set, which are mapped to some bits of binary sequence with three hash functions respectively. If we judge whether W is in the set, we also map with three hash functions, and the result is not all 1, then w is not in the set.
Workflow:
- Step 1: open up space:
Open up a length of m digits group (or binary vector), this different language has different ways of implementation, even you can use files to achieve. - Step 2: find hash function
Get several hash functions. Predecessors have invented many well running hash functions, such as BKDRHash, JSHash, RSHash, etc. We can get these hash functions directly. - Step 3: write data
After calculating these hash functions, we can get several values, such as three hash functions, and the values are respectively 100020003000. Then set the value bit binary 1 of the 100020003000 bit of the m-bit array. - Step 4: Judgment
Then we can judge whether a new content is in our collection. The judging process is consistent with the writing process.
4. Advantages and disadvantages of bloon filter
1. Advantages:
- Good space and time efficiency
- Storage space and insert / query time are constants.
- Hash functions have no relationship with each other, so they can be implemented by hardware in parallel.
- It does not need to store the element itself, and has advantages in some situations where confidentiality requirements are very strict.
- A bloom filter can represent a complete set, not any other data structure.
2. Disadvantages:
- The miscalculation rate will increase with the increase of elements
- Elements cannot be removed from a bloon filter
5. Precautions for bloon filter
The idea of Bloom filter is relatively simple, but for the random mapping function design of Bloom filter, it needs to be calculated several times, and how much vector length is set is more appropriate, which needs to be seriously discussed.
If the length of the vector is too short, the error rate will rise in a straight line.
If the vector is too long, a lot of memory will be wasted.
If there are too many calculations, the computing resources will be consumed, and it is easy to fill the filter quickly.
6. Go to realize bloon filter
1. Simple demonstration of open source package
package main import ( "fmt" "github.com/willf/bitset" "math/rand" ) func main() { Foo() bar() } func Foo() { var b bitset.BitSet // Define a BitSet object b.Set(1).Set(2).Set(3) //Add 3 elements if b.Test(2) { fmt.Println("2 Already exist") } fmt.Println("Total:", b.Count()) b.Clear(2) if !b.Test(2) { fmt.Println("2 Non-existent") } fmt.Println("Total:", b.Count()) } func bar() { fmt.Printf("Hello from BitSet!\n") var b bitset.BitSet // play some Go Fish for i := 0; i < 100; i++ { card1 := uint(rand.Intn(52)) card2 := uint(rand.Intn(52)) b.Set(card1) if b.Test(card2) { fmt.Println("Go Fish!") } b.Clear(card1) } // Chaining b.Set(10).Set(11) for i, e := b.NextSet(0); e; i, e = b.NextSet(i + 1) { fmt.Println("The following bit is set:", i) } // intersection if b.Intersection(bitset.New(100).Set(10)).Count() == 1 { fmt.Println("Intersection works.") } else { fmt.Println("Intersection doesn't work???") } }
2. Packaging method:
//---------------------------------------------------------------------------- // @ Copyright (C) free license,without warranty of any kind . // @ Author: hollson <hollson@live.com> // @ Date: 2019-12-06 // @ Version: 1.0.0 //------------------------------------------------------------------------------ package bloomx import "github.com/willf/bitset" const DEFAULT_SIZE = 2<<24 var seeds = []uint{7, 11, 13, 31, 37, 61} type BloomFilter struct { Set *bitset.BitSet Funcs [6]SimpleHash } func NewBloomFilter() *BloomFilter { bf := new(BloomFilter) for i:=0;i< len(bf.Funcs);i++{ bf.Funcs[i] = SimpleHash{DEFAULT_SIZE,seeds[i]} } bf.Set = bitset.New(DEFAULT_SIZE) return bf } func (bf BloomFilter) Add(value string){ for _,f:=range(bf.Funcs){ bf.Set.Set(f.hash(value)) } } func (bf BloomFilter) Contains(value string) bool { if value == "" { return false } ret := true for _,f:=range(bf.Funcs){ ret = ret && bf.Set.Test(f.hash(value)) } return ret } type SimpleHash struct{ Cap uint Seed uint } func (s SimpleHash) hash(value string) uint{ var result uint = 0 for i:=0;i< len(value);i++{ result = result*s.Seed+uint(value[i]) } return (s.Cap-1)&result }
func main() { filter := bloomx.NewBloomFilter() fmt.Println(filter.Funcs[1].Seed) str1 := "hello,bloom filter!" filter.Add(str1) str2 := "A happy day" filter.Add(str2) str3 := "Greate wall" filter.Add(str3) fmt.Println(filter.Set.Count()) fmt.Println(filter.Contains(str1)) fmt.Println(filter.Contains(str2)) fmt.Println(filter.Contains(str3)) fmt.Println(filter.Contains("blockchain technology")) }
100 W order of magnitude under the bloon filter test, source code can be referred to https://download.csdn.net/download/Gusand/12018239
Reference resources:
//Recommended: https://www.cnblogs.com/z941030/p/9218356.html
https://www.jianshu.com/p/01309d298a0e
https://www.cnblogs.com/zengdan-develpoer/p/4425167.html
https://blog.csdn.net/liuzhijun301/article/details/83040178
https://github.com/willf/bloom