Source code analysis - Golang Map

Keywords: Go source code

Correlation constant parsing

bucketCntBits = 3 // It represents bit 
bucketCnt     = 1 << bucketCntBits // Represents a bucket(bmap) with a maximum storage of 8 key s

loadFactorNum = 13
loadFactorDen = 2 // The load factor is calculated from these two factors (the load factor is related to when to trigger capacity expansion)

maxKeySize  = 128  
maxElemSize = 128 // 

emptyRest      = 0  : On behalf of the topHash Corresponding K/V available ,Or it represents the position and the position behind it bucket Also available
emptyOne       = 1  : Only on behalf of the topHash Corresponding K/V available
evacuatedX     = 2 : And rehash of,Represents that the original element may have been migrated to X position(In situ),Of course, it is also possible to migrate to Y position
evacuatedY     = 3
evacuatedEmpty = 4  : When this bucket After all the elements in are migrated,set up evacuatedEmpty
minTopHash     = 5 
When topHash<=5 When,Status is stored,Otherwise, what is stored is hash value
  • The subscript corresponding to each tophash is a kv

map structure

  • src/runtime/map.go

  • The internal object is hmap

    • type hmap struct {
      	// Note: the format of the hmap is also encoded in cmd/compile/internal/gc/reflect.go..
      	count     int // Number of elements in map
      	flags     uint8	// Identification status
      	B         uint8  // Used to set the maximum number of buckets to 2^B, that is, len(buckets)=2^B
      	noverflow uint16 // Number of overflowing buckets
      	hash0     uint32 // hash seed, involving hash functions
      
      	buckets    unsafe.Pointer // Pointer object of buckets
      	oldbuckets unsafe.Pointer // buckets when capacity expansion is triggered
      	nevacuate  uintptr        // Progressive is the progress of rehash, which is similar to redis
      
      	extra *mapextra // 
      }
      
  • At the same time, similar to the hashMap of Java, it also has the concept of bucket. In golang, it is bmap

    • type bmap struct {
      	tophash [bucketCnt]uint8 // It can be found that a bucket can only store 8 key s 
      }
      The actual object generated after compilation is:
      type bmap struct {
        topbits  [8]uint8
        keys     [8]keytype
        values   [8]valuetype
        pad      uintptr
        overflow uintptr  // When K and V are non pointer objects, in order to avoid being scanned by gc, overflow will be moved to hmap, so that bmap still does not contain pointers
      }
      
  • The memory model of bmap is:

    • Key / key / key = > value / value / value, not key/value/key/value

    • [the external chain image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-G2p99OKJ-1631841576712)(/Users/joker/Nutstore Files / my nut cloud / review / imgs/golang_map_bmap.png)]

Initialization of MAP

  • m1 := make(map[int]int)  // The corresponding internal function is makemap_small
    m2 := make(map[int]int, 10) // The corresponding function is makemap. Creating a map supports passing a parameter indicating the initial size
    
  • func makemap_small() *hmap {
    	h := new(hmap)
    	h.hash0 = fastrand()
    	return h
    }
    It's just simple new One,Will not initialize bucket array
    
  • Core functions:

    • 
      func makemap(t *maptype, hint int, h *hmap) *hmap {
      	mem, overflow := math.MulUintptr(uintptr(hint), t.bucket.size)
      	if overflow || mem > maxAlloc {
      		hint = 0
      	}
      
      
      	if h == nil {
      		h = new(hmap)
      	}
      	// Get a random factor
      	h.hash0 = fastrand()
      
      	// hint refers to the expected size value when creating a map. This is similar to the hashMap of Java, which will eventually make the initial capacity to the nth power of 2
      	B := uint8(0)
      	for overLoadFactor(hint, B) {
      		B++
      	}
      	h.B = B
      
      
      	// When B==0, it means that the initialization of buckets will be triggered only when it is written by put
      	if h.B != 0 {
      		var nextOverflow *bmap
      		// Request to create bucket array
      		h.buckets, nextOverflow = makeBucketArray(t, h.B, nil)
      		if nextOverflow != nil {
      			h.extra = new(mapextra)
      			h.extra.nextOverflow = nextOverflow
      		}
      	}
      
      	return h
      }
      
  • summary

    • During map initialization, there are only two cases in general. One is
      • makemap_small: in this way, only hmap s will be created and buckets will not be initialized
      • makemap: automatically modify the capacity to the nth power of 2, and then initialize the buckets array

MAP # put

  • Function: src/runtime/map.go#mapassign

    • Phase I: initialization phase

      • 	// ... omit some regular debug ging and verification
        	
        	// Determine whether to read and write concurrently
        	if h.flags&hashWriting != 0 {
        		throw("concurrent map writes")
        	}
        	// The corresponding hash function will be obtained at compile time
        	hash := t.hasher(key, uintptr(h.hash0))
        	
        	// Identification is in write state (used for concurrent read-write judgment)
        	h.flags ^= hashWriting
        	
        	// If it's a makemap_small, buckets are not initialized at this time
        	if h.buckets == nil {
        		h.buckets = newobject(t.bucket) // newarray(t.bucket, 1)
        	}
        
        
        
    • Stage 2: locate bucket

      •  // Get the memory address of the bucket
        b := (*bmap)(add(h.buckets, bucket*uintptr(t.bucketsize)))
        // Get the top 8 bits of the hash as the key
        top := tophash(hash)
        
        	var inserti *uint8
        	var insertk unsafe.Pointer
        	var elem unsafe.Pointer
        bucketloop:
        	for {
        	// Traverse each cell
        		for i := uintptr(0); i < bucketCnt; i++ {
        		// If the current hash does not match the upper 8-bit hash
        			if b.tophash[i] != top {
        				// If the bucket is nil and the current element has no assignment
        				if isEmpty(b.tophash[i]) && inserti == nil {
        					inserti = &b.tophash[i]
        					insertk = add(unsafe.Pointer(b), dataOffset+i*uintptr(t.keysize))
        					elem = add(unsafe.Pointer(b), dataOffset+bucketCnt*uintptr(t.keysize)+i*uintptr(t.elemsize))
        				}
        				// If the current bucket is in the overflow state, which means that the capacity is insufficient, the entire write will be skipped directly
        				if b.tophash[i] == emptyRest {
        					break bucketloop
        				}
        				continue
        			}
        			// Start updating values
        			k := add(unsafe.Pointer(b), dataOffset+i*uintptr(t.keysize))
        			// Judge whether to store a pointer or an element. If it is an element, dereference it
        			if t.indirectkey() {
        				k = *((*unsafe.Pointer)(k))
        			}
        			// Only the same key can be updated
        			if !t.key.equal(key, k) {
        				continue
        			}
        			// Through memory copy, update key,value,
        			if t.needkeyupdate() {
        				typedmemmove(t.key, k, key)
        			}
        			// Finally, move the pointer handle to point to the new value
        			elem = add(unsafe.Pointer(b), dataOffset+bucketCnt*uintptr(t.keysize)+i*uintptr(t.elemsize))
        			goto done
        		}
        		// If the above does not exit, it means that the elements in the current bucket are full, and we need to get the next one
        		ovf := b.overflow(t)
        		if ovf == nil {
        		// It means that all the bucket s are full
        			break
        		}
        		// Traverse to get the next bucket and continue the for loop
        		b = ovf
        	}
        	// It indicates that the same key is not found or inserted, so the bucket s may be full
        	// If capacity expansion needs to be triggered at present (that is, capacity expansion will be triggered when the current average number of elements in each bucket > = loadactor)
        	if !h.growing() && (overLoadFactor(h.count+1, h.B) || tooManyOverflowBuckets(h.noverflow, h.B)) {
        		// Start capacity expansion
        		hashGrow(t, h)
        		// Because the capacity expansion involves rehash, you need to go through it again
        		goto again 
        	}
        
    • Stage 3: applying for a new bucket

      • When we get here,On behalf of,bucket Full,Need to apply for new bucket,Then everything starts reassigning
        if inserti == nil {
        		// The current bucket and all the overflow buckets connected to it are full, allocate a new one.
        		newb := h.newoverflow(t, b)
        		inserti = &newb.tophash[0]
        		insertk = add(unsafe.Pointer(newb), dataOffset)
        		elem = add(insertk, bucketCnt*uintptr(t.keysize))
        	}
        	
        	// Store K and V. if they are not pointers, they also need to be dereferenced
        	if t.indirectkey() {
        		kmem := newobject(t.key)
        		*(*unsafe.Pointer)(insertk) = kmem
        		insertk = kmem
        	}
        	if t.indirectelem() {
        		vmem := newobject(t.elem)
        		*(*unsafe.Pointer)(elem) = vmem
        	}
        	// Memory copy key
        	typedmemmove(t.key, insertk, key)
        	*inserti = top
        	h.count++
        	
        	// Finally, eliminate the flag bit
        done:
        	if h.flags&hashWriting == 0 {
        		throw("concurrent map writes")
        	}
        	h.flags &^= hashWriting
        	if t.indirectelem() {
        		elem = *((*unsafe.Pointer)(elem))
        	}
        	return elem
        

Map expansion rehash

  • golang's rehash is a progressive hashing process. First, apply for a new bucket array through hashGrow (or do not apply at all: the second trigger case). Then, each time you write or read data, you will judge whether the current map is in the rehash process. If so, rehash will be assisted

  • The most critical function is evaluate

  • There are two reasons for capacity expansion

    • One is the capacity expansion caused by reaching the loadFactor

    • The other is caused by too many overflow s

      • // Loading factor exceeds 6.5
        func overLoadFactor(count int, B uint8) bool {
          return count > bucketCnt && uintptr(count) > loadFactorNum*(bucketShift(B)/loadFactorDen)
        }
        
        // Too many overflow buckets
        func tooManyOverflowBuckets(noverflow uint16, B uint8) bool {
          if B > 15 {
            B = 15
          }
          return noverflow >= uint16(1)<<(B&15)
        }
        
    • If it is the former, the capacity of the bucket will be doubled directly (that is, the binary will be moved down and left by one bit)

  • func evacuate(t *maptype, h *hmap, oldbucket uintptr) {
    	b := (*bmap)(add(h.oldbuckets, oldbucket*uintptr(t.bucketsize)))
    	newbit := h.noldbuckets()
    	if !evacuated(b) { // First, judge whether the current bucket has been rehash ed (through the internal flag, because the bmap will be installed and replaced with the bmap above)
    		var xy [2]evacDst // Because the expansion may be twice as large, an array with a length of 2 is defined, and 0 is used to locate the previous elements
    		x := &xy[0]
    		x.b = (*bmap)(add(h.buckets, oldbucket*uintptr(t.bucketsize)))
    		x.k = add(unsafe.Pointer(x.b), dataOffset)
    		x.e = add(x.k, bucketCnt*uintptr(t.keysize))
    
    		if !h.sameSizeGrow() {
    			// Double the expansion, so it may affect another element, so record the affected element
    			y := &xy[1]
    			y.b = (*bmap)(add(h.buckets, (oldbucket+newbit)*uintptr(t.bucketsize)))
    			y.k = add(unsafe.Pointer(y.b), dataOffset)
    			y.e = add(y.k, bucketCnt*uintptr(t.keysize))
    		}
    		// Starting from the current bucket, traverse each bucket, because buckets are connected together
    		for ; b != nil; b = b.overflow(t) {
    			k := add(unsafe.Pointer(b), dataOffset)
    			e := add(k, bucketCnt*uintptr(t.keysize))
    			// Traverse each element inside the bucket
    			for i := 0; i < bucketCnt; i, k, e = i+1, add(k, uintptr(t.keysize)), add(e, uintptr(t.elemsize)) {
    				top := b.tophash[i]
    				// If it is an empty value (without assignment), it is directly marked as rehash
    				if isEmpty(top) {
    					b.tophash[i] = evacuatedEmpty
    					continue
    				}
    				// Not null, but not the initial value, panic
    				if top < minTopHash {
    					throw("bad map state")
    				}
    				// If it is a pointer, dereference is triggered
    				k2 := k
    				if t.indirectkey() {
    					k2 = *((*unsafe.Pointer)(k2))
    				}
    				var useY uint8
    				if !h.sameSizeGrow() {
    					// If it is a 2x expansion, recalculate the hash value
    					hash := t.hasher(k2, uintptr(h.hash0))
    					if h.flags&iterator != 0 && !t.reflexivekey() && !t.key.equal(k2, k2) {
    						// It indicates that there is a routine traversing the map. At the same time, the key does not match after recalculating the hash, which means that the value needs to be
    						Move to a new bucket,therefore,In order to have a new hash, There will be one here &1 Operation of,The beauty of this operation is
    						bring rehash Posterior bucket subscript,Or in the original position,Or in bucketIndex+2^B At two locations
    						useY = top & 1,This is actually related to Java Very much ,however Java How did it happen? I forgot :-(
    						top = tophash(hash)
    					} else {
    						if hash&newbit != 0 {
    							useY = 1
    						}
    					}
    				}
    
    				if evacuatedX+1 != evacuatedY || evacuatedX^1 != evacuatedY {
    					throw("bad evacuatedN")
    				}
    
    				b.tophash[i] = evacuatedX + useY // evacuatedX + 1 == evacuatedY
    				dst := &xy[useY]                 // evacuation destination
    				// If the bucket of the current element happens to be the last bucket
    				if dst.i == bucketCnt {
    					dst.b = h.newoverflow(t, dst.b)
    					dst.i = 0
    					dst.k = add(unsafe.Pointer(dst.b), dataOffset)
    					dst.e = add(dst.k, bucketCnt*uintptr(t.keysize))
    				}
    				dst.b.tophash[dst.i&(bucketCnt-1)] = top // mask dst.i as an optimization, to avoid a bounds check
    				// Copy assignment / direct assignment
    				if t.indirectkey() {
    					*(*unsafe.Pointer)(dst.k) = k2 // copy pointer
    				} else {
    					typedmemmove(t.key, dst.k, k) // copy elem
    				}
    				if t.indirectelem() {
    					*(*unsafe.Pointer)(dst.e) = *(*unsafe.Pointer)(e)
    				} else {
    					typedmemmove(t.elem, dst.e, e)
    				}
    				dst.i++
    				dst.k = add(dst.k, uintptr(t.keysize))
    				dst.e = add(dst.e, uintptr(t.elemsize))
    			}
    		}
    		// Finally, the hmap is dereferenced from oldBuckets so that it can be used by gc
    		if h.flags&oldIterator == 0 && t.bucket.ptrdata != 0 {
    			b := add(h.oldbuckets, oldbucket*uintptr(t.bucketsize))
    			ptr := add(b, dataOffset)
    			n := uintptr(t.bucketsize) - dataOffset
    			memclrHasPointers(ptr, n)
    		}
    	}
    
    	if oldbucket == h.nevacuate {
    	// Finally, judge whether all rehash is completed. If yes, eliminate some flag bits
    		advanceEvacuationMark(h, t, newbit)
    	}
    }
    

Deletion of Map

  • func mapdelete(t *maptype, h *hmap, key unsafe.Pointer) {
    	// .... Omit the debug information
    	if h == nil || h.count == 0 {
    		if t.hashMightPanic() {
    			t.hasher(key, 0) // see issue 23734
    		}
    		return
    	}
    	// Concurrent read / write judgment
    	if h.flags&hashWriting != 0 {
    		throw("concurrent map writes")
    	}
    	// Get the hash corresponding to this key
    	hash := t.hasher(key, uintptr(h.hash0))
    
     // Add security sign
    	h.flags ^= hashWriting
    	// Obtain the corresponding bucket subscript through the upper 8 bits of the hash
    	bucket := hash & bucketMask(h.B)
    	// If capacity expansion is in progress at this time, auxiliary capacity expansion is required
    	if h.growing() {
    		growWork(t, h, bucket)
    	}
    	// Through offset: get the memory address of bmap(cell), which is the first place in the linked list
    	b := (*bmap)(add(h.buckets, bucket*uintptr(t.bucketsize)))
    	bOrig := b
    	// Get the upper 8 bits of hash
    	top := tophash(hash)
    search:
    	for ; b != nil; b = b.overflow(t) {
    		for i := uintptr(0); i < bucketCnt; i++ {
    			if b.tophash[i] != top {
    			// If the cell has been marked as empty, it means that there is no need to query and judge later, and it ends quickly
    				if b.tophash[i] == emptyRest {
    					break search
    				}
    				continue
    			}
    			k := add(unsafe.Pointer(b), dataOffset+i*uintptr(t.keysize))
    			k2 := k
    			// Dereference
    			if t.indirectkey() {
    				k2 = *((*unsafe.Pointer)(k2))
    			}
    			if !t.key.equal(key, k2) {
    				continue
    			}
    			
    			if t.indirectkey() {
    				*(*unsafe.Pointer)(k) = nil
    			} else if t.key.ptrdata != 0 {
    				memclrHasPointers(k, t.key.size)
    			}
    			// Get the corresponding value
    			e := add(unsafe.Pointer(b), dataOffset+bucketCnt*uintptr(t.keysize)+i*uintptr(t.elemsize))
    			// Clear value
    			if t.indirectelem() {
    				*(*unsafe.Pointer)(e) = nil
    			} else if t.elem.ptrdata != 0 {
    				memclrHasPointers(e, t.elem.size)
    			} else {
    				memclrNoHeapPointers(e, t.elem.size)
    			}
    			// Identifies that the cell is available
    			b.tophash[i] = emptyOne
    			if i == bucketCnt-1 {
    				if b.overflow(t) != nil && b.overflow(t).tophash[0] != emptyRest {
    					goto notLast
    				}
    				// Note the topHash[0] of the previous bucket has been set to emptyRest, which means that the entire bucket is available
    			} else {
    				if b.tophash[i+1] != emptyRest {
    					goto notLast
    				}
    				// Note that the next topHash has been set to emptyRest, and the previous ones are available
    			}
    			// Set to emptyRest
    			for {
    				b.tophash[i] = emptyRest
    				if i == 0 {
    					if b == bOrig {
    						break // beginning of initial bucket, we're done.
    					}
    					// Find previous bucket, continue at its last entry.
    					c := b
    					for b = bOrig; b.overflow(t) != c; b = b.overflow(t) {
    					}
    					i = bucketCnt - 1
    				} else {
    					i--
    				}
    				if b.tophash[i] != emptyOne {
    					break
    				}
    			}
    		notLast:
    			h.count--
    			// Reset the hash seed to make it more difficult for attackers to
    			// repeatedly trigger hash collisions. See issue 25237.
    			if h.count == 0 {
    				h.hash0 = fastrand()
    			}
    			break search
    		}
    	}
    	// Remove protection bit
    	if h.flags&hashWriting == 0 {
    		throw("concurrent map writes")
    	}
    	h.flags &^= hashWriting
    }
    

Acquisition of Map

  • func mapaccessK(t *maptype, h *hmap, key unsafe.Pointer) (unsafe.Pointer, unsafe.Pointer) {
    	if h == nil || h.count == 0 {
    		return nil, nil
    	}
    	hash := t.hasher(key, uintptr(h.hash0))
    	m := bucketMask(h.B)
    	// Obtain the corresponding bucket address through the lower 8 bits of the hash
    	b := (*bmap)(unsafe.Pointer(uintptr(h.buckets) + (hash&m)*uintptr(t.bucketsize)))
    	// Description: capacity expansion in progress
    	if c := h.oldbuckets; c != nil {
    		// If you don't wait for size expansion
    		if !h.sameSizeGrow() {
    			// Get the address of the previous bucket
    			// There used to be half as many buckets; mask down one more power of two.
    			m >>= 1
    		}
    		oldb := (*bmap)(unsafe.Pointer(uintptr(c) + (hash&m)*uintptr(t.bucketsize)))
    		if !evacuated(oldb) {
    		// If the previous bucket has not been rehash, it indicates that the data is still in the original place. Use the previous bucket
    			b = oldb
    		}
    	}
    	top := tophash(hash)
    bucketloop:
    // Traverse the cell for matching, and then find the result
    	for ; b != nil; b = b.overflow(t) {
    		for i := uintptr(0); i < bucketCnt; i++ {
    			if b.tophash[i] != top {
    				if b.tophash[i] == emptyRest {
    					break bucketloop
    				}
    				continue
    			}
    			k := add(unsafe.Pointer(b), dataOffset+i*uintptr(t.keysize))
    			if t.indirectkey() {
    				k = *((*unsafe.Pointer)(k))
    			}
    			if t.key.equal(key, k) {
    				e := add(unsafe.Pointer(b), dataOffset+bucketCnt*uintptr(t.keysize)+i*uintptr(t.elemsize))
    				if t.indirectelem() {
    					e = *((*unsafe.Pointer)(e))
    				}
    				return k, e
    			}
    		}
    	}
    	return nil, nil
    }
    

summary

  • A lot of state variables are set in the golang map, such as emptyOne,emptyRest and so on, which are used to quickly fail

  • map is implemented by hmap+bmap in the bottom layer. The solution of hash conflict is similar to that of Java. It is also solved by zipper method. By default, a bmap can only store 8 K and V, and the memory model of K and V in bmap is key and then value. The reason is to reduce padding

  • The internal overflow of bmap points to the extra of hmap to avoid being scanned by gc

  • Similar to Java, there is also a key load factor. golang defaults to 6.5. The calculation method of this value is count / number of buckets, that is, the calculation result is the recommended number of cell s stored in each bucket

  • The capacity expansion of golang's map is similar to that of redis. Progressive rehash is adopted. Only two buckets are expanded at a time. At the same time, there are two kinds of capacity expansion opportunities for golang. One is to reach the load factor, and the other is too many overflow buckets (the maximum value of this value is 2 ^ 15). Reaching the load factor will double the capacity of the whole bucket, and the latter is equal size

  • golang rehash is similar to Java. It is either in place or twice the position of the current bucket. The specific implementation is through the original hash & 1

  • The basic process is the same

    • golang map bucket positioning is to locate the bucket through the lower 6 bits of the hash. After obtaining the bucket, the subscript of topHash is obtained through the upper 8 bits of the hash
    • If the bucket cannot find the corresponding value, locate it in the overflow bucket of the bucket
    • Then, traverse the internal cell, and start the corresponding processing when tophash matches
    • lookup
      • If the current capacity expansion is in progress, and the oldBuckets is not empty, it will first judge whether it is the same size expansion or has been expanded. If it is double expansion, it will first obtain the previous bucket address to judge whether it has been rehash ed. If not, the original bucket will be used
      • If there is a matching in the corresponding topHash, it will be returned directly
    • add to
      • Judge whether capacity expansion is in progress. If capacity expansion is in progress, it will be assisted first
      • Then traverse the cell s in the bucket. If there are duplicates, update them. Otherwise, insert new data
      • Finally, judge whether the two conditions for capacity expansion are met. If yes, start to prepare for capacity expansion, but not directly expand, but mark it as expandable
    • delete
      • Similarly, it will also judge whether capacity expansion is in progress. If yes, auxiliary capacity expansion
      • Then traverse the cell s in the bucket, and set the matching to null. Finally, it will optimize and judge whether the previous bucket is also empty (to assist future operations)

problem

  • The reason for map disorder is

    • When rehash, the hash will be recalculated and a random factor will be added
  • The role of overflow in bmap

    • The function is to create a bucket when the bucket overflows (because the number of elements in the bucket is fixed at 8), and when the 9th key is also in the bucket, a bucket will be created, and then connected through the overflow pointer to form a linked list
    • Why is the number of bucket s fixed
      • Because the top hash of bmap is 8 bits higher, it is 8 bits (but there seems to be no basis)
    • When will the number of overflow s increase
      • When put, if the bucket element is full and the overflow buckets are full, a new bucket will be applied to point to overflow
  • What is tophash and what is its function

    • tophash is the upper 8 bits of the hash
    • The function is to:
      • topHash is the top 8 bits of the hash. It is used for fast positioning, because each bucket has a hash. This topHash can be quickly matched with it. If it is not satisfied, it will be quickly next
  • Timing of capacity expansion

      1. When the number of cell s in each bucket > = LoadFactor
      2. When the number of overflow > 2 ^ 15 square meters (the maximum value is 2 ^ 15)
  • Why does bmap take the form of key/key/key/value/value instead of key/value

    • It is also related to the operating system. The operating system cache is stored in the cache block in the form of cache line, and the size of each line is fixed. If the same data is cached in two cache lines, the hit rate is low and the efficiency is low. Therefore, there will be the former form of padding and map, so that padding only needs to be placed at the end of value instead of key/value/padding
  • The role of flags and B in hmap

    • The function of flags is to judge whether it is in concurrent read-write state. When writing, it will be marked as write state. The same is true for reading

Posted by sufian on Sat, 18 Sep 2021 22:23:59 -0700