Bloom filter in Golang

Keywords: Go github Blockchain

Catalog

1. Concept of bloon filter

Bloom filter is a binary vector data structure proposed by Howard Bloom in 1970. It has good space and time efficiency. It is used to detect whether an element is a member of a set, that is, to determine "may already exist" and "absolutely not exist". If the detection result is yes, the element is not necessarily in the collection; but if the detection result is no, the element must not be in the collection, so bloom filter has a 100% recall rate.


2. Application scenario of bloon filter

  • Spam filtering
  • Prevent buffer breakdown
  • Bitcoin transaction query
  • URL filtering for Crawlers
  • IP blacklist
  • Query acceleration [such as data based on KV structure]
  • The judgment of repeated set elements


3. Working principle of bloon filter

The core of the bloon filter is a very large set of bits and several hash functions. Suppose the length of the digit group is m and the number of hash functions is k.
The following figure shows that there are three hash functions. For example, there are three elements x, y and z in a set, which are mapped to some bits of binary sequence with three hash functions respectively. If we judge whether W is in the set, we also map with three hash functions, and the result is not all 1, then w is not in the set.

Workflow:

  • Step 1: open up space:
    Open up a length of m digits group (or binary vector), this different language has different ways of implementation, even you can use files to achieve.
  • Step 2: find hash function
    Get several hash functions. Predecessors have invented many well running hash functions, such as BKDRHash, JSHash, RSHash, etc. We can get these hash functions directly.
  • Step 3: write data
    After calculating these hash functions, we can get several values, such as three hash functions, and the values are respectively 100020003000. Then set the value bit binary 1 of the 100020003000 bit of the m-bit array.
  • Step 4: Judgment
    Then we can judge whether a new content is in our collection. The judging process is consistent with the writing process.


4. Advantages and disadvantages of bloon filter

1. Advantages:

  • Good space and time efficiency
  • Storage space and insert / query time are constants.
  • Hash functions have no relationship with each other, so they can be implemented by hardware in parallel.
  • It does not need to store the element itself, and has advantages in some situations where confidentiality requirements are very strict.
  • A bloom filter can represent a complete set, not any other data structure.

2. Disadvantages:

  • The miscalculation rate will increase with the increase of elements
  • Elements cannot be removed from a bloon filter


5. Precautions for bloon filter

The idea of Bloom filter is relatively simple, but for the random mapping function design of Bloom filter, it needs to be calculated several times, and how much vector length is set is more appropriate, which needs to be seriously discussed.
If the length of the vector is too short, the error rate will rise in a straight line.
If the vector is too long, a lot of memory will be wasted.
If there are too many calculations, the computing resources will be consumed, and it is easy to fill the filter quickly.


6. Go to realize bloon filter

1. Simple demonstration of open source package

package main
import (
   "fmt"
   "github.com/willf/bitset"
   "math/rand"
)

func main() {
   Foo()
   bar()
}

func Foo() {
   var b bitset.BitSet // Define a BitSet object

   b.Set(1).Set(2).Set(3) //Add 3 elements
   if b.Test(2) {
      fmt.Println("2 Already exist")
   }
   fmt.Println("Total:", b.Count())

   b.Clear(2)
   if !b.Test(2) {
      fmt.Println("2 Non-existent")
   }
   fmt.Println("Total:", b.Count())
}

func bar() {
   fmt.Printf("Hello from BitSet!\n")
   var b bitset.BitSet
   // play some Go Fish
   for i := 0; i < 100; i++ {
      card1 := uint(rand.Intn(52))
      card2 := uint(rand.Intn(52))
      b.Set(card1)
      if b.Test(card2) {
         fmt.Println("Go Fish!")
      }
      b.Clear(card1)
   }

   // Chaining
   b.Set(10).Set(11)

   for i, e := b.NextSet(0); e; i, e = b.NextSet(i + 1) {
      fmt.Println("The following bit is set:", i)
   }
   // intersection
   if b.Intersection(bitset.New(100).Set(10)).Count() == 1 {
      fmt.Println("Intersection works.")
   } else {
      fmt.Println("Intersection doesn't work???")
   }
}

2. Packaging method:

//----------------------------------------------------------------------------
// @ Copyright (C) free license,without warranty of any kind .
// @ Author: hollson <hollson@live.com>
// @ Date: 2019-12-06
// @ Version: 1.0.0
//------------------------------------------------------------------------------
package bloomx
import "github.com/willf/bitset"

const DEFAULT_SIZE = 2<<24
var seeds = []uint{7, 11, 13, 31, 37, 61}

type BloomFilter struct {
   Set *bitset.BitSet
   Funcs [6]SimpleHash
}

func NewBloomFilter() *BloomFilter {
   bf := new(BloomFilter)
   for i:=0;i< len(bf.Funcs);i++{
      bf.Funcs[i] = SimpleHash{DEFAULT_SIZE,seeds[i]}
   }
   bf.Set = bitset.New(DEFAULT_SIZE)
   return bf
}

func (bf BloomFilter) Add(value string){
   for _,f:=range(bf.Funcs){
      bf.Set.Set(f.hash(value))
   }
}

func (bf BloomFilter) Contains(value string) bool {
   if value == "" {
      return false
   }
   ret := true
   for _,f:=range(bf.Funcs){
      ret = ret && bf.Set.Test(f.hash(value))
   }
   return ret
}

type SimpleHash struct{
   Cap uint
   Seed uint
}

func (s SimpleHash) hash(value string) uint{
   var result uint = 0
   for i:=0;i< len(value);i++{
      result = result*s.Seed+uint(value[i])
   }
   return (s.Cap-1)&result
}
func main() {
   filter := bloomx.NewBloomFilter()
   fmt.Println(filter.Funcs[1].Seed)
   str1 := "hello,bloom filter!"
   filter.Add(str1)
   str2 := "A happy day"
   filter.Add(str2)
   str3 := "Greate wall"
   filter.Add(str3)

   fmt.Println(filter.Set.Count())
   fmt.Println(filter.Contains(str1))
   fmt.Println(filter.Contains(str2))
   fmt.Println(filter.Contains(str3))
   fmt.Println(filter.Contains("blockchain technology"))
}

100 W order of magnitude under the bloon filter test, source code can be referred to https://download.csdn.net/download/Gusand/12018239


Reference resources:
//Recommended: https://www.cnblogs.com/z941030/p/9218356.html
https://www.jianshu.com/p/01309d298a0e
https://www.cnblogs.com/zengdan-develpoer/p/4425167.html
https://blog.csdn.net/liuzhijun301/article/details/83040178
https://github.com/willf/bloom

Posted by jl5501 on Thu, 12 Dec 2019 05:54:50 -0800