Source code analysis - Golang Map

Correlation constant parsing

bucketCntBits = 3 // It represents bit 
bucketCnt     = 1 << bucketCntBits // Represents a bucket(bmap) with a maximum storage of 8 key s

loadFactorNum = 13
loadFactorDen = 2 // The load factor is calculated from these two factors (the load factor is related to when to trigger capacity expansion)

maxKeySize  = 128  
maxElemSize = 128 // 

emptyRest      = 0  : On behalf of the topHash Corresponding K/V available ,Or it represents the position and the position behind it bucket Also available
emptyOne       = 1  : Only on behalf of the topHash Corresponding K/V available
evacuatedX     = 2 : And rehash of,Represents that the original element may have been migrated to X position(In situ),Of course, it is also possible to migrate to Y position
evacuatedY     = 3
evacuatedEmpty = 4  : When this bucket After all the elements in are migrated,set up evacuatedEmpty
minTopHash     = 5 
When topHash<=5 When,Status is stored,Otherwise, what is stored is hash value

The subscript corresponding to each tophash is a kv

map structure

src/runtime/map.go

The internal object is hmap

type hmap struct {
	// Note: the format of the hmap is also encoded in cmd/compile/internal/gc/reflect.go..
	count     int // Number of elements in map
	flags     uint8	// Identification status
	B         uint8  // Used to set the maximum number of buckets to 2^B, that is, len(buckets)=2^B
	noverflow uint16 // Number of overflowing buckets
	hash0     uint32 // hash seed, involving hash functions

	buckets    unsafe.Pointer // Pointer object of buckets
	oldbuckets unsafe.Pointer // buckets when capacity expansion is triggered
	nevacuate  uintptr        // Progressive is the progress of rehash, which is similar to redis

	extra *mapextra // 
}

At the same time, similar to the hashMap of Java, it also has the concept of bucket. In golang, it is bmap

type bmap struct {
	tophash [bucketCnt]uint8 // It can be found that a bucket can only store 8 key s 
}
The actual object generated after compilation is:
type bmap struct {
  topbits  [8]uint8
  keys     [8]keytype
  values   [8]valuetype
  pad      uintptr
  overflow uintptr  // When K and V are non pointer objects, in order to avoid being scanned by gc, overflow will be moved to hmap, so that bmap still does not contain pointers
}

The memory model of bmap is:
- Key / key / key = > value / value / value, not key/value/key/value
- [the external chain image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-G2p99OKJ-1631841576712)(/Users/joker/Nutstore Files / my nut cloud / review / imgs/golang_map_bmap.png)]

Initialization of MAP

m1 := make(map[int]int)  // The corresponding internal function is makemap_small
m2 := make(map[int]int, 10) // The corresponding function is makemap. Creating a map supports passing a parameter indicating the initial size

func makemap_small() *hmap {
	h := new(hmap)
	h.hash0 = fastrand()
	return h
}
It's just simple new One,Will not initialize bucket array

Core functions:


func makemap(t *maptype, hint int, h *hmap) *hmap {
	mem, overflow := math.MulUintptr(uintptr(hint), t.bucket.size)
	if overflow || mem > maxAlloc {
		hint = 0
	}


	if h == nil {
		h = new(hmap)
	}
	// Get a random factor
	h.hash0 = fastrand()

	// hint refers to the expected size value when creating a map. This is similar to the hashMap of Java, which will eventually make the initial capacity to the nth power of 2
	B := uint8(0)
	for overLoadFactor(hint, B) {
		B++
	}
	h.B = B


	// When B==0, it means that the initialization of buckets will be triggered only when it is written by put
	if h.B != 0 {
		var nextOverflow *bmap
		// Request to create bucket array
		h.buckets, nextOverflow = makeBucketArray(t, h.B, nil)
		if nextOverflow != nil {
			h.extra = new(mapextra)
			h.extra.nextOverflow = nextOverflow
		}
	}

	return h
}

summary
- During map initialization, there are only two cases in general. One is
  - makemap_small: in this way, only hmap s will be created and buckets will not be initialized
  - makemap: automatically modify the capacity to the nth power of 2, and then initialize the buckets array

MAP # put

Function: src/runtime/map.go#mapassign

Phase I: initialization phase

	// ... omit some regular debug ging and verification
	
	// Determine whether to read and write concurrently
	if h.flags&hashWriting != 0 {
		throw("concurrent map writes")
	}
	// The corresponding hash function will be obtained at compile time
	hash := t.hasher(key, uintptr(h.hash0))
	
	// Identification is in write state (used for concurrent read-write judgment)
	h.flags ^= hashWriting
	
	// If it's a makemap_small, buckets are not initialized at this time
	if h.buckets == nil {
		h.buckets = newobject(t.bucket) // newarray(t.bucket, 1)
	}

Stage 2: locate bucket

 // Get the memory address of the bucket
b := (*bmap)(add(h.buckets, bucket*uintptr(t.bucketsize)))
// Get the top 8 bits of the hash as the key
top := tophash(hash)

	var inserti *uint8
	var insertk unsafe.Pointer
	var elem unsafe.Pointer
bucketloop:
	for {
	// Traverse each cell
		for i := uintptr(0); i < bucketCnt; i++ {
		// If the current hash does not match the upper 8-bit hash
			if b.tophash[i] != top {
				// If the bucket is nil and the current element has no assignment
				if isEmpty(b.tophash[i]) && inserti == nil {
					inserti = &b.tophash[i]
					insertk = add(unsafe.Pointer(b), dataOffset+i*uintptr(t.keysize))
					elem = add(unsafe.Pointer(b), dataOffset+bucketCnt*uintptr(t.keysize)+i*uintptr(t.elemsize))
				}
				// If the current bucket is in the overflow state, which means that the capacity is insufficient, the entire write will be skipped directly
				if b.tophash[i] == emptyRest {
					break bucketloop
				}
				continue
			}
			// Start updating values
			k := add(unsafe.Pointer(b), dataOffset+i*uintptr(t.keysize))
			// Judge whether to store a pointer or an element. If it is an element, dereference it
			if t.indirectkey() {
				k = *((*unsafe.Pointer)(k))
			}
			// Only the same key can be updated
			if !t.key.equal(key, k) {
				continue
			}
			// Through memory copy, update key,value,
			if t.needkeyupdate() {
				typedmemmove(t.key, k, key)
			}
			// Finally, move the pointer handle to point to the new value
			elem = add(unsafe.Pointer(b), dataOffset+bucketCnt*uintptr(t.keysize)+i*uintptr(t.elemsize))
			goto done
		}
		// If the above does not exit, it means that the elements in the current bucket are full, and we need to get the next one
		ovf := b.overflow(t)
		if ovf == nil {
		// It means that all the bucket s are full
			break
		}
		// Traverse to get the next bucket and continue the for loop
		b = ovf
	}
	// It indicates that the same key is not found or inserted, so the bucket s may be full
	// If capacity expansion needs to be triggered at present (that is, capacity expansion will be triggered when the current average number of elements in each bucket > = loadactor)
	if !h.growing() && (overLoadFactor(h.count+1, h.B) || tooManyOverflowBuckets(h.noverflow, h.B)) {
		// Start capacity expansion
		hashGrow(t, h)
		// Because the capacity expansion involves rehash, you need to go through it again
		goto again 
	}

Stage 3: applying for a new bucket

When we get here,On behalf of,bucket Full,Need to apply for new bucket,Then everything starts reassigning
if inserti == nil {
		// The current bucket and all the overflow buckets connected to it are full, allocate a new one.
		newb := h.newoverflow(t, b)
		inserti = &newb.tophash[0]
		insertk = add(unsafe.Pointer(newb), dataOffset)
		elem = add(insertk, bucketCnt*uintptr(t.keysize))
	}
	
	// Store K and V. if they are not pointers, they also need to be dereferenced
	if t.indirectkey() {
		kmem := newobject(t.key)
		*(*unsafe.Pointer)(insertk) = kmem
		insertk = kmem
	}
	if t.indirectelem() {
		vmem := newobject(t.elem)
		*(*unsafe.Pointer)(elem) = vmem
	}
	// Memory copy key
	typedmemmove(t.key, insertk, key)
	*inserti = top
	h.count++
	
	// Finally, eliminate the flag bit
done:
	if h.flags&hashWriting == 0 {
		throw("concurrent map writes")
	}
	h.flags &^= hashWriting
	if t.indirectelem() {
		elem = *((*unsafe.Pointer)(elem))
	}
	return elem

Map expansion rehash

golang's rehash is a progressive hashing process. First, apply for a new bucket array through hashGrow (or do not apply at all: the second trigger case). Then, each time you write or read data, you will judge whether the current map is in the rehash process. If so, rehash will be assisted
The most critical function is evaluate

There are two reasons for capacity expansion

One is the capacity expansion caused by reaching the loadFactor

The other is caused by too many overflow s

// Loading factor exceeds 6.5
func overLoadFactor(count int, B uint8) bool {
  return count > bucketCnt && uintptr(count) > loadFactorNum*(bucketShift(B)/loadFactorDen)
}

// Too many overflow buckets
func tooManyOverflowBuckets(noverflow uint16, B uint8) bool {
  if B > 15 {
    B = 15
  }
  return noverflow >= uint16(1)<<(B&15)
}

If it is the former, the capacity of the bucket will be doubled directly (that is, the binary will be moved down and left by one bit)

func evacuate(t *maptype, h *hmap, oldbucket uintptr) {
	b := (*bmap)(add(h.oldbuckets, oldbucket*uintptr(t.bucketsize)))
	newbit := h.noldbuckets()
	if !evacuated(b) { // First, judge whether the current bucket has been rehash ed (through the internal flag, because the bmap will be installed and replaced with the bmap above)
		var xy [2]evacDst // Because the expansion may be twice as large, an array with a length of 2 is defined, and 0 is used to locate the previous elements
		x := &xy[0]
		x.b = (*bmap)(add(h.buckets, oldbucket*uintptr(t.bucketsize)))
		x.k = add(unsafe.Pointer(x.b), dataOffset)
		x.e = add(x.k, bucketCnt*uintptr(t.keysize))

		if !h.sameSizeGrow() {
			// Double the expansion, so it may affect another element, so record the affected element
			y := &xy[1]
			y.b = (*bmap)(add(h.buckets, (oldbucket+newbit)*uintptr(t.bucketsize)))
			y.k = add(unsafe.Pointer(y.b), dataOffset)
			y.e = add(y.k, bucketCnt*uintptr(t.keysize))
		}
		// Starting from the current bucket, traverse each bucket, because buckets are connected together
		for ; b != nil; b = b.overflow(t) {
			k := add(unsafe.Pointer(b), dataOffset)
			e := add(k, bucketCnt*uintptr(t.keysize))
			// Traverse each element inside the bucket
			for i := 0; i < bucketCnt; i, k, e = i+1, add(k, uintptr(t.keysize)), add(e, uintptr(t.elemsize)) {
				top := b.tophash[i]
				// If it is an empty value (without assignment), it is directly marked as rehash
				if isEmpty(top) {
					b.tophash[i] = evacuatedEmpty
					continue
				}
				// Not null, but not the initial value, panic
				if top < minTopHash {
					throw("bad map state")
				}
				// If it is a pointer, dereference is triggered
				k2 := k
				if t.indirectkey() {
					k2 = *((*unsafe.Pointer)(k2))
				}
				var useY uint8
				if !h.sameSizeGrow() {
					// If it is a 2x expansion, recalculate the hash value
					hash := t.hasher(k2, uintptr(h.hash0))
					if h.flags&iterator != 0 && !t.reflexivekey() && !t.key.equal(k2, k2) {
						// It indicates that there is a routine traversing the map. At the same time, the key does not match after recalculating the hash, which means that the value needs to be
						Move to a new bucket,therefore,In order to have a new hash, There will be one here &1 Operation of,The beauty of this operation is
						bring rehash Posterior bucket subscript,Or in the original position,Or in bucketIndex+2^B At two locations
						useY = top & 1,This is actually related to Java Very much ,however Java How did it happen? I forgot :-(
						top = tophash(hash)
					} else {
						if hash&newbit != 0 {
							useY = 1
						}
					}
				}

				if evacuatedX+1 != evacuatedY || evacuatedX^1 != evacuatedY {
					throw("bad evacuatedN")
				}

				b.tophash[i] = evacuatedX + useY // evacuatedX + 1 == evacuatedY
				dst := &xy[useY]                 // evacuation destination
				// If the bucket of the current element happens to be the last bucket
				if dst.i == bucketCnt {
					dst.b = h.newoverflow(t, dst.b)
					dst.i = 0
					dst.k = add(unsafe.Pointer(dst.b), dataOffset)
					dst.e = add(dst.k, bucketCnt*uintptr(t.keysize))
				}
				dst.b.tophash[dst.i&(bucketCnt-1)] = top // mask dst.i as an optimization, to avoid a bounds check
				// Copy assignment / direct assignment
				if t.indirectkey() {
					*(*unsafe.Pointer)(dst.k) = k2 // copy pointer
				} else {
					typedmemmove(t.key, dst.k, k) // copy elem
				}
				if t.indirectelem() {
					*(*unsafe.Pointer)(dst.e) = *(*unsafe.Pointer)(e)
				} else {
					typedmemmove(t.elem, dst.e, e)
				}
				dst.i++
				dst.k = add(dst.k, uintptr(t.keysize))
				dst.e = add(dst.e, uintptr(t.elemsize))
			}
		}
		// Finally, the hmap is dereferenced from oldBuckets so that it can be used by gc
		if h.flags&oldIterator == 0 && t.bucket.ptrdata != 0 {
			b := add(h.oldbuckets, oldbucket*uintptr(t.bucketsize))
			ptr := add(b, dataOffset)
			n := uintptr(t.bucketsize) - dataOffset
			memclrHasPointers(ptr, n)
		}
	}

	if oldbucket == h.nevacuate {
	// Finally, judge whether all rehash is completed. If yes, eliminate some flag bits
		advanceEvacuationMark(h, t, newbit)
	}
}

Deletion of Map

func mapdelete(t *maptype, h *hmap, key unsafe.Pointer) {
	// .... Omit the debug information
	if h == nil || h.count == 0 {
		if t.hashMightPanic() {
			t.hasher(key, 0) // see issue 23734
		}
		return
	}
	// Concurrent read / write judgment
	if h.flags&hashWriting != 0 {
		throw("concurrent map writes")
	}
	// Get the hash corresponding to this key
	hash := t.hasher(key, uintptr(h.hash0))

 // Add security sign
	h.flags ^= hashWriting
	// Obtain the corresponding bucket subscript through the upper 8 bits of the hash
	bucket := hash & bucketMask(h.B)
	// If capacity expansion is in progress at this time, auxiliary capacity expansion is required
	if h.growing() {
		growWork(t, h, bucket)
	}
	// Through offset: get the memory address of bmap(cell), which is the first place in the linked list
	b := (*bmap)(add(h.buckets, bucket*uintptr(t.bucketsize)))
	bOrig := b
	// Get the upper 8 bits of hash
	top := tophash(hash)
search:
	for ; b != nil; b = b.overflow(t) {
		for i := uintptr(0); i < bucketCnt; i++ {
			if b.tophash[i] != top {
			// If the cell has been marked as empty, it means that there is no need to query and judge later, and it ends quickly
				if b.tophash[i] == emptyRest {
					break search
				}
				continue
			}
			k := add(unsafe.Pointer(b), dataOffset+i*uintptr(t.keysize))
			k2 := k
			// Dereference
			if t.indirectkey() {
				k2 = *((*unsafe.Pointer)(k2))
			}
			if !t.key.equal(key, k2) {
				continue
			}
			
			if t.indirectkey() {
				*(*unsafe.Pointer)(k) = nil
			} else if t.key.ptrdata != 0 {
				memclrHasPointers(k, t.key.size)
			}
			// Get the corresponding value
			e := add(unsafe.Pointer(b), dataOffset+bucketCnt*uintptr(t.keysize)+i*uintptr(t.elemsize))
			// Clear value
			if t.indirectelem() {
				*(*unsafe.Pointer)(e) = nil
			} else if t.elem.ptrdata != 0 {
				memclrHasPointers(e, t.elem.size)
			} else {
				memclrNoHeapPointers(e, t.elem.size)
			}
			// Identifies that the cell is available
			b.tophash[i] = emptyOne
			if i == bucketCnt-1 {
				if b.overflow(t) != nil && b.overflow(t).tophash[0] != emptyRest {
					goto notLast
				}
				// Note the topHash[0] of the previous bucket has been set to emptyRest, which means that the entire bucket is available
			} else {
				if b.tophash[i+1] != emptyRest {
					goto notLast
				}
				// Note that the next topHash has been set to emptyRest, and the previous ones are available
			}
			// Set to emptyRest
			for {
				b.tophash[i] = emptyRest
				if i == 0 {
					if b == bOrig {
						break // beginning of initial bucket, we're done.
					}
					// Find previous bucket, continue at its last entry.
					c := b
					for b = bOrig; b.overflow(t) != c; b = b.overflow(t) {
					}
					i = bucketCnt - 1
				} else {
					i--
				}
				if b.tophash[i] != emptyOne {
					break
				}
			}
		notLast:
			h.count--
			// Reset the hash seed to make it more difficult for attackers to
			// repeatedly trigger hash collisions. See issue 25237.
			if h.count == 0 {
				h.hash0 = fastrand()
			}
			break search
		}
	}
	// Remove protection bit
	if h.flags&hashWriting == 0 {
		throw("concurrent map writes")
	}
	h.flags &^= hashWriting
}

Acquisition of Map

func mapaccessK(t *maptype, h *hmap, key unsafe.Pointer) (unsafe.Pointer, unsafe.Pointer) {
	if h == nil || h.count == 0 {
		return nil, nil
	}
	hash := t.hasher(key, uintptr(h.hash0))
	m := bucketMask(h.B)
	// Obtain the corresponding bucket address through the lower 8 bits of the hash
	b := (*bmap)(unsafe.Pointer(uintptr(h.buckets) + (hash&m)*uintptr(t.bucketsize)))
	// Description: capacity expansion in progress
	if c := h.oldbuckets; c != nil {
		// If you don't wait for size expansion
		if !h.sameSizeGrow() {
			// Get the address of the previous bucket
			// There used to be half as many buckets; mask down one more power of two.
			m >>= 1
		}
		oldb := (*bmap)(unsafe.Pointer(uintptr(c) + (hash&m)*uintptr(t.bucketsize)))
		if !evacuated(oldb) {
		// If the previous bucket has not been rehash, it indicates that the data is still in the original place. Use the previous bucket
			b = oldb
		}
	}
	top := tophash(hash)
bucketloop:
// Traverse the cell for matching, and then find the result
	for ; b != nil; b = b.overflow(t) {
		for i := uintptr(0); i < bucketCnt; i++ {
			if b.tophash[i] != top {
				if b.tophash[i] == emptyRest {
					break bucketloop
				}
				continue
			}
			k := add(unsafe.Pointer(b), dataOffset+i*uintptr(t.keysize))
			if t.indirectkey() {
				k = *((*unsafe.Pointer)(k))
			}
			if t.key.equal(key, k) {
				e := add(unsafe.Pointer(b), dataOffset+bucketCnt*uintptr(t.keysize)+i*uintptr(t.elemsize))
				if t.indirectelem() {
					e = *((*unsafe.Pointer)(e))
				}
				return k, e
			}
		}
	}
	return nil, nil
}

summary

A lot of state variables are set in the golang map, such as emptyOne,emptyRest and so on, which are used to quickly fail
map is implemented by hmap+bmap in the bottom layer. The solution of hash conflict is similar to that of Java. It is also solved by zipper method. By default, a bmap can only store 8 K and V, and the memory model of K and V in bmap is key and then value. The reason is to reduce padding
The internal overflow of bmap points to the extra of hmap to avoid being scanned by gc
Similar to Java, there is also a key load factor. golang defaults to 6.5. The calculation method of this value is count / number of buckets, that is, the calculation result is the recommended number of cell s stored in each bucket
The capacity expansion of golang's map is similar to that of redis. Progressive rehash is adopted. Only two buckets are expanded at a time. At the same time, there are two kinds of capacity expansion opportunities for golang. One is to reach the load factor, and the other is too many overflow buckets (the maximum value of this value is 2 ^ 15). Reaching the load factor will double the capacity of the whole bucket, and the latter is equal size
golang rehash is similar to Java. It is either in place or twice the position of the current bucket. The specific implementation is through the original hash & 1
The basic process is the same
- golang map bucket positioning is to locate the bucket through the lower 6 bits of the hash. After obtaining the bucket, the subscript of topHash is obtained through the upper 8 bits of the hash
- If the bucket cannot find the corresponding value, locate it in the overflow bucket of the bucket
- Then, traverse the internal cell, and start the corresponding processing when tophash matches
- lookup
  - If the current capacity expansion is in progress, and the oldBuckets is not empty, it will first judge whether it is the same size expansion or has been expanded. If it is double expansion, it will first obtain the previous bucket address to judge whether it has been rehash ed. If not, the original bucket will be used
  - If there is a matching in the corresponding topHash, it will be returned directly
- add to
  - Judge whether capacity expansion is in progress. If capacity expansion is in progress, it will be assisted first
  - Then traverse the cell s in the bucket. If there are duplicates, update them. Otherwise, insert new data
  - Finally, judge whether the two conditions for capacity expansion are met. If yes, start to prepare for capacity expansion, but not directly expand, but mark it as expandable
- delete
  - Similarly, it will also judge whether capacity expansion is in progress. If yes, auxiliary capacity expansion
  - Then traverse the cell s in the bucket, and set the matching to null. Finally, it will optimize and judge whether the previous bucket is also empty (to assist future operations)

problem

The reason for map disorder is
- When rehash, the hash will be recalculated and a random factor will be added
The role of overflow in bmap
- The function is to create a bucket when the bucket overflows (because the number of elements in the bucket is fixed at 8), and when the 9th key is also in the bucket, a bucket will be created, and then connected through the overflow pointer to form a linked list
- Why is the number of bucket s fixed
  - Because the top hash of bmap is 8 bits higher, it is 8 bits (but there seems to be no basis)
- When will the number of overflow s increase
  - When put, if the bucket element is full and the overflow buckets are full, a new bucket will be applied to point to overflow
What is tophash and what is its function
- tophash is the upper 8 bits of the hash
- The function is to:
  - topHash is the top 8 bits of the hash. It is used for fast positioning, because each bucket has a hash. This topHash can be quickly matched with it. If it is not satisfied, it will be quickly next
Timing of capacity expansion
- 1. When the number of cell s in each bucket > = LoadFactor
  2. When the number of overflow > 2 ^ 15 square meters (the maximum value is 2 ^ 15)
Why does bmap take the form of key/key/key/value/value instead of key/value
- It is also related to the operating system. The operating system cache is stored in the cache block in the form of cache line, and the size of each line is fixed. If the same data is cached in two cache lines, the hit rate is low and the efficiency is low. Therefore, there will be the former form of padding and map, so that padding only needs to be placed at the end of value instead of key/value/padding
The role of flags and B in hmap
- The function of flags is to judge whether it is in concurrent read-write state. When writing, it will be marked as write state. The same is true for reading

Posted by sufian on Sat, 18 Sep 2021 22:23:59 -0700

Programmer Group