bitcount optimization path

Question:

Use Go to implement the bitcount function to count the number of bits set to 1 in a uint64 value.

Option 1:

The easiest way to think about this is to move one bit to the right each time, to check if the last bit is 1, so that after the bit-by-bit detection is completed, you can get the result.

func bitCount1(n uint64)int8{
   var count int8

   var i uint

   for i < 64 {

      if ( (n >> i) & 1) != 0{
         count += 1
      }

      i += 1
   }

   return count
}

var BitCount = bitCount1

Implement a test function and a benchmark function to test correctness and performance:

Test environment:

Model name: MacBook Pro
Processor name: Intel Core i7
Processor speed: 2.5 GHz
Number of processors: 1
Total Number of Kernels: 4
L2 cache (per core): 256 KB
L3 Cache: 6 MB
Hyperthreading: Enabled
Memory: 16 GB

// main_test.go

package main

import "testing"

var tests = []struct{
   input uint64
   want int8
}{
   { 7118255637391829670 , 34 },
   { 7064722311543391783 , 25 },
   { 4608963400064623015 , 34 },
   { 14640564048961355682 , 39 },
   { 8527726038136987990 , 27 },
   { 9253052485357177493 , 29 },
   { 8999835155724014433 , 28 },
   { 14841333124033177794 , 35 },
   { 1220369398144154468 , 33 },
   { 15451339541988045209 , 33 },
   { 2516280747705128559 , 28 },
   { 4938673901915240208 , 29 },
   { 410238832127885933 , 29 },
   { 1332323607442058439 , 33 },
   { 15877566392368361617 , 30 },
   { 3880651382986322995 , 35 },
   { 3639402890245875445 , 30 },
   { 16428413304724738456 , 39 },
   { 14754380477986223775 , 37 },
   { 2517156707207435586 , 29 },
   { 15317696849870933326 , 30 },
   { 6013290537376992905 , 35 },
   { 17378274584566732685 , 29 },
   { 5420397259425817882 , 31 },
   { 11286722219793612146 , 35 },
   { 8183954261149622513 , 30 },
   { 17190026713975474863 , 41 },
   { 379948598362354167 , 34 },
   { 3606292518508638567 , 31 },
   { 10997458781072603457 , 33 },
   { 7601699521132896572 , 31 },
   { 16795555978365209258 , 34 },
   { 9555709025715093094 , 35 },
   { 2957346674371128176 , 29 },
   { 6297615394333342337 , 36 },
   { 15800332447329707343 , 31 },
   { 10989482291558635871 , 36 },
   { 10116688196032604814 , 29 },
   { 13017684861263524258 , 29 },
   { 9721224553709591475 , 35 },
   { 7710983100732971068 , 28 },
   { 11089894095639460077 , 38 },
   { 938751439326355368 , 34 },
   { 8732591979705398236 , 33 },
   { 5679915963518233779 , 36 },
   { 16532909388555451248 , 33 },
   { 13248011246533683006 , 31 },
   { 1317996811516389703 , 30 },
   { 4318476060009242000 , 33 },
   { 3082899072464871007 , 34 },
}

func TestBitCount(t *testing.T){

   for _, test := range tests{

      if got := BitCount(test.input); got != test.want{
         t.Errorf("BitCount(%q) = %v", test.input, got)
      }

   }
}

func BenchmarkBitCount(b *testing.B) {
   var input uint64 = 5679915963518233779

   for i := 0; i < b.N; i ++{
      BitCount(input)
   }
}

The command line executes go test -bench=. with the following output:

Average one execution time 91.2ns

The for loop is fixed 64 times and can be optimized slightly because the 64th bit of the input value is not necessarily 1, as long as the 1 with the highest bit is detected, it will end.

func bitCount11(n uint64) int8{
   var count int8

   for n != 0 {

      if ( n & 1) != 0{
         count += 1
      }

      n = n >> 1
   }

   return count
}

var BitCount = bitCount11

Run a test,

The performance hasn't changed, or even worse.However, when the number is small, the execution time is much shorter, such as when the input is 1:

The implementation of the previous version corresponds to 41.5ns

Option 2:

The implementation of scenario 1 is limited by the number of bits with the highest bit value of 1. Even if only one bit is 1 in 0x8000000000000000000, and 64 cycles are needed, is there any way to focus only on the number of bits with a value of 1?If you are familiar with bit manipulation, it is easy to think that n = n & (n-1) can have a bit position of 0 with the lowest value of n of 1, such as 14 & 13 = (0b1110) & (0b1101) = 0b1100.In this way, there are M bits with a value of 1 in the value n, and the number of cycles detected is M.

func bitCount2(n uint64)int8{
   var count int8

   for n != 0{
      n = n & ( n - 1 )
      count += 1
   }

   return count
}

Run a test:

Time has dropped to 1/3 of the original level, the promotion is still very strong.

If n is entered as 1, the test results are:

It matches the optimized results in the first edition.

Option 3:

The results of the last two versions are influenced by the number of bits M with a median value of 1 in the value n. Are there any algorithms with a constant time?The idea of space-for-time is easy to think of a table lookup, which prewrites all possible values of a byte (8 bits) and the corresponding bit-bit M with a value of 1 to the table, then divides n into eight bytes to look up the table, and adds up the results.

func bitCount31(n uint64)int8{

   table := [256]int8{
      0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
      1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
      1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
      2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
      1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
      2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
      2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
      3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
      1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
      2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
      2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
      3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
      2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
      3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
      3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
      4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8,
   };

   var count int8

   for n != 0 {
      count += table[n & 0xff]
      n = n >> 8
   }

   return count
}

The test results are as follows:

11.6ns! Only one third of scenario two!

However, when n is 1, the performance is slightly worse, reaching 9.9ns.

Changing the size of the table to 16 (that is, to build a table with 4 bit s) yields another interesting result:

func bitCount32(n uint64)int8{

   table := [16]int8{0, 1, 1, 2, 1, 2, 2, 3,  1, 2, 2, 3, 2, 3, 3, 4}

   var count int8 = 0

   for n != 0 {
      count += table[n & 0xf]
      n = n >> 4
   }

   return count
}

When n is 1, only 3.06ns is needed, which performs much better:

In fact, when the input increases from 0x0 to 0xffffffffffff (moving 4 bits left each time and lowest 4 positions to 0xf), the former increases linearly from 8.86ns to 11.6ns, and the latter from 2.41ns to 11.5ns, so it is better to build tables with 4 bits.

Option 4:

With the idea of division, results can also be obtained in a constant time.The bits of n are divided into 2 groups, 4 groups, 8 groups, 16 groups, 32 groups, 64 groups, and the number of digits of 1 can be calculated.Take 0b1111 as an example, two groups get: 1, 1 1, 1, add the values of each bit in the group, get the number of group 1: 10, 10, and then add the values in the group according to four groups, get 0b10+0b10 = 0b100, the corresponding decimal value is 4, that is, the number of 0b1111.(This algorithm is called variable-precision SWAR algorithm, more detailed description can be seen at the end of the reference link)

func bitCount41(n uint64)int8 {
   n = (n & 0x5555555555555555) + ((n >> 1) & 0x5555555555555555)
   n = (n & 0x3333333333333333) + ((n >> 2) & 0x3333333333333333)
   n = (n & 0x0f0f0f0f0f0f0f0f) + ((n >> 4) & 0x0f0f0f0f0f0f0f0f)
   n = (n & 0x00ff00ff00ff00ff) + ((n >> 8) & 0x00ff00ff00ff00ff)
   n = (n & 0x0000ffff0000ffff) + ((n >> 16) & 0x0000ffff0000ffff)
   n = (n & 0x00000000ffffffff) + ((n >> 32) & 0x00000000ffffffff)

   return int8(n & 0x7f)
}

With only 5.23ns, it is halved from the implementation of lookup tables.Moreover, when n is 0 or 0 xffffffffffffffff, the result is stable at about 5.23ns.

uint64 has a maximum of 64 1, 64 corresponding binary values will not exceed one byte, so let's optimize for unnecessary calculations:

func bitCount42(n uint64)int8 {
   n = n - ((n >> 1) & 0x5555555555555555)
   n = (n & 0x3333333333333333) + ((n >> 2) & 0x3333333333333333)
   n = (n & 0x0f0f0f0f0f0f0f0f) + ((n >> 4) & 0x0f0f0f0f0f0f0f0f)
   n = n + (n >> 8)
   n = n + (n >> 16)
   n = n + (n >> 32)
   return int8(n & 0x7f)
}

Time-consuming dropped to 3.91ns! It dropped to nearly a third of the lookup table, to 3.91/91.2 = 4.3% of the first implementation

Attachment:

For the implementation of table lookup, tables can be placed outside of functions (as global variables) to make the benchmark data look better.When tables are placed within a function, each call allocates space in the stack to place the values of the table array, which can be time consuming.If table is a global variable, the test results for table building by 4 bits and table building by 8 bits are the opposite. When n is 0, both of them take 2 ns. When n = 0xffffffffffffffff, the former takes about 11.5 ns, and the latter takes about 7 ns.When n = 0xffffff (full 24-bit 1), the latter takes about 4ns, which is equivalent to scheme 4.Therefore, when the input is mostly distributed within 0xffffff, the 8bit table lookup method is chosen for better time performance.

You can use pprof to analyze how long they take at the code level.

From the command line, execute:

$ go test -c main_test.go main.go

$ ./main.test -test.bench=. -test.cpuprofile=cpu-profile.prof

$ go tool pprof main.test cpu-profile.prof

Next to the input in pprof: weblist bitCount

table as a local variable of the function:

table as a global variable:

When n is 0, time-consuming:

When n is 0xffffffffffffffff, time-consuming:

Summary:

In the iteration optimization process of the above implementation, the main idea is to reduce the number of executions of code blocks within the loop, from a fixed 64 to a position that depends on the highest 1, then to the number of bits with a bit value of 1, and finally to a constant number of operations by lookup table or divide and conquer (up to 8 times when building a table with 8 bits, divide and conquer fixed 6 times), the lookup table needs to execute a function stack once.The process of allocating array space and assigning values takes more time, while partitioning is the most common algorithm used in production environment implementations because of its low time-consuming and stable performance.

Reference resources:

variable-precision SWAR algorithm:

https://ivanzz1001.github.io/...
https://segmentfault.com/a/11...

golang's pprof uses:

https://blog.golang.org/profi...

CPU Branch Prediction Model:

https://zhuanlan.zhihu.com/p/...

Posted by elite_prodigy on Sat, 29 Jun 2019 10:07:22 -0700

Programmer Group

bitcount optimization path

Hot Keywords