Underlying implementation principle of HashMap put method

Keywords: Javascript data structure

The HashMap data structure is very useful in most development scenarios, so we must understand its underlying principle when using it, so that we can skillfully optimize our program according to its design when using it. Later, I will explain several problems and try to be easy to understand.

1. How is the process of hash index calculation implemented? What points in design are worth learning from

2. The underlying implementation principle of hashMap expansion?

3. We know that the hash initialization length requires a length of 16 multiples. Why, if it is not a multiple of 16, what will be the impact?

4. What problem does tail interpolation solve?

If you are also interested in these four questions, please look down and share your ideas.

1, How is the process of hash index calculation implemented? What points in design are worth learning from

Pre knowledge: hash conflict, binary shift operator

It is inevitable that hash functions have conflicts, which is determined by its algorithm basis. Therefore, there will be conflicts when hashMap stores data

Binary operators. This part of knowledge is needed as the basis for the follow-up, mainly understanding the execution rules of XOR ^, and &, or | and shift right > > >;

operatorrule
^0 ^ 0 = 0,01 = 1,10 = 1,1 ^ 1 = 0. As long as the two bits participating in bit operation are the same as 0, the difference is 1
&0 & 0 = 0, 0 & 1 = 0, 1 & 1 = 1. That is, the two digits are 1 at the same time, and the result is 1, otherwise it is 0
Example: 3 & 5 = 1. (000011 & 000101 = 00000 1)
|0|0 = 0,1|0 = 1,0|1 = 1,1|1 = 1. As long as one of the two bits participating in the bit operation is 1, it is 1
Example: 3 | 5 = 7 (0000011 | 00000101 = 0000111)
>>>

Equivalent to dividing by 2

Example: 16 > > > 3 (10000 > > > 3 = 10)

Looking at this rule, we ask a question: if the and & operator is 1, it is 1. Can it be used to take modules? For ex amp le, integer type 12 (1100) needs to take the module of 3 (0011). It seems feasible to use the & operation. If the value of module 3 (0011) remains unchanged, no matter who does the & operation with, the result will not be greater than 3. Think about this principle. Let's throw a question first. Let's think about it. We'll use it later.

Source code analysis:

Next, let's look at how the hashMap source code calculates the index. It mainly focuses on the underlying implementation of put

    /**
     * Associates the specified value with the specified key in this map.
     * If the map previously contained a mapping for the key, the old
     * value is replaced.
     *
     * @param key key with which the specified value is to be associated
     * @param value value to be associated with the specified key
     * @return the previous value associated with <tt>key</tt>, or
     *         <tt>null</tt> if there was no mapping for <tt>key</tt>.
     *         (A <tt>null</tt> return can also indicate that the map
     *         previously associated <tt>null</tt> with <tt>key</tt>.)
     */
    public V put(K key, V value) {
        return putVal(hash(key), key, value, false, true);
    }

Then look at the source code of hash(key);

    /**
     * Computes key.hashCode() and spreads (XORs) higher bits of hash
     * to lower.  Because the table uses power-of-two masking, sets of
     * hashes that vary only in bits above the current mask will
     * always collide. (Among known examples are sets of Float keys
     * holding consecutive whole numbers in small tables.)  So we
     * apply a transform that spreads the impact of higher bits
     * downward. There is a tradeoff between speed, utility, and
     * quality of bit-spreading. Because many common sets of hashes
     * are already reasonably distributed (so don't benefit from
     * spreading), and because we use trees to handle large sets of
     * collisions in bins, we just XOR some shifted bits in the
     * cheapest possible way to reduce systematic lossage, as well as
     * to incorporate impact of the highest bits that would otherwise
     * never be used in index calculations because of table bounds.
     */
    static final int hash(Object key) {
        int h;
        return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
    }

Seeing this, I will have some questions. How did hashCode come from? Why do you want to do XOR with its 16 bit right shift? What are the advantages? Can you not do it? If not, what will be the problems. With these questions, let's take a look at the cleverness of the design.

Pre knowledge: int data is 4 bytes, that is, 32 bits, with a maximum of 2 ^ 31-1

If key==null, the key value pair is stored in the bucket with index 0. If it is not empty, the hashcode value of the key is calculated, and the h is shifted 16 bits to the right to obtain the hash value by XOR operation.
Why shift 16 bits to the right?
      In fact, it is to reduce collisions and further reduce the probability of hash conflicts. The value of int type is 4 bytes. Shifting 16 bits to the right XOR can retain the feature that the high 16 bits are located in the low 16 bits at the same time
Why XOR?
      First, move the upper 16 bits unsigned right by 16 bits and XOR the lower 16 bits. If you do not do this, but directly perform & operation, some features represented by the high 16 bits may be lost. After moving the high 16 bits unsigned right, XOR operation is performed with the low 16 bits, so that the high and low information in the new value obtained by mixing the features of the high 16 bits with the features of the low 16 bits are retained, The reason why XOR operation is adopted here instead of &, | operation is that XOR operation can better retain the characteristics of each part. If & operation is adopted, the value calculated will be close to 1, and the value calculated by | operation will be close to 0
 

Here we continue to look at the source code, hash to index calculation.

/**
     * Implements Map.put and related methods.
     *
     * @param hash hash for key
     * @param key the key
     * @param value the value to put
     * @param onlyIfAbsent if true, don't change existing value
     * @param evict if false, the table is in creation mode.
     * @return previous value, or null if none
     */
    final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
                   boolean evict) {
        // Declaration object
        Node<K,V>[] tab; Node<K,V> p; int n, i;
        // If the built-in storage data object is empty, expand the capacity first
        if ((tab = table) == null || (n = tab.length) == 0)
            // Capacity expansion
            n = (tab = resize()).length;
        // Calculate the index and assign the current index data to p. if it is empty, it means that there is no conflict in the current position and the value can be stored directly
        if ((p = tab[i = (n - 1) & hash]) == null)
            // Direct assignment
            tab[i] = newNode(hash, key, value, null);
        else {
            // If it is not empty, it indicates that there is a conflict and the linked list needs to be processed or upgraded
            Node<K,V> e; K k;
            // If the current index data is the same as the incoming hash and the key is the same, it indicates that this is the overwrite data operation of the map
            if (p.hash == hash &&
                ((k = p.key) == key || (key != null && key.equals(k))))
                // Assign the reference address to e, and then overwrite the value of E
                e = p;
            else if (p instanceof TreeNode)
                // If p has evolved into a red black tree, insert the tree node
                e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value);
            else {
                for (int binCount = 0; ; ++binCount) {
                    // Here is the linked list operation. It is neither a book structure nor a value overlay. If the successor of p is empty
                    if ((e = p.next) == null) {
                          // We assign the value to the successor of p
                        p.next = newNode(hash, key, value, null);
                        if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
                            // If the maximum linked list length is just greater than 7 after adding a data, it will be upgraded to a red black tree
                            treeifyBin(tab, hash);
                        break;
                    }
                    if (e.hash == hash &&
                        ((k = e.key) == key || (key != null && key.equals(k))))
                        break;
                    p = e;
                }
            }
            // Value override
            if (e != null) { // existing mapping for key
                V oldValue = e.value;
                if (!onlyIfAbsent || oldValue == null)
                    e.value = value;
                afterNodeAccess(e);
                return oldValue;
            }
        }
        ++modCount;
        if (++size > threshold)
            // Capacity expansion
            resize();
        afterNodeInsertion(evict);
        return null;
    }

It's easy to see the tab [i = (n - 1) & hash] in the source code. Obviously, this is the place to calculate the index. The initial default of n is 16

    /**
     * The default initial capacity - MUST be a power of two.
     */
    static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16

The above is the value of n. if 1 is shifted by 4 bits to the left, it is binary (0001 < < 4 = 10000 = 16)

We mentioned above whether the & operation can achieve the effect of modulo. The source code here is to modulo 15 (1111). Any int type data and (1111) will not be greater than (1111), and all values may be involved. Therefore, if the initialization here is not a multiple of 16, such as the initialization length 12 (1100), we will find that, Here & there are only four results that can be calculated by the operation (1000110001000000), so this is why the initialization must be a multiple of 16.

1. Problem summary:

  • The hash shifts 16 bits to the right and performs XOR operation with itself in order to reduce conflict
  • Modulo is obtained by & and operation, and the length must be a multiple of 16. See the above for the reason.

2, The underlying implementation principle of hashMap expansion?

Let's guess first, and then see if hashmap is implemented in this way. The capacity expansion mechanism (hashmap will recreate the node [], then recalculate the index of each data, and reassign the value to the new node array until the last data is completed, so as to realize the capacity expansion by replacing the old one with the new one)

Source code:

/**
     * Initializes or doubles table size.  If null, allocates in
     * accord with initial capacity target held in field threshold.
     * Otherwise, because we are using power-of-two expansion, the
     * elements from each bin must either stay at same index, or move
     * with a power of two offset in the new table.
     *
     * @return the table
     */
    final Node<K,V>[] resize() {
        // Declaration object
        Node<K,V>[] oldTab = table;
        // Record the length of the current node array
        int oldCap = (oldTab == null) ? 0 : oldTab.length;
        // Record the load value
        int oldThr = threshold;
        int newCap, newThr = 0;
        // If the current map has a value
        if (oldCap > 0) {
            // If the stored data is crazy and close to the maximum int value, then give up, do not expand, destroy and tired
            if (oldCap >= MAXIMUM_CAPACITY) {
                threshold = Integer.MAX_VALUE;
                return oldTab;
            }
            // There is still salvation
            else if ((newCap = oldCap << 1) < MAXIMUM_CAPACITY &&
                     oldCap >= DEFAULT_INITIAL_CAPACITY)
                // When the load is doubled, the load is the load factor * maximum length
                newThr = oldThr << 1; // double threshold
        }
        // initialization
        else if (oldThr > 0) // initial capacity was placed in threshold
            newCap = oldThr;
        // initialization
        else {               // zero initial threshold signifies using defaults
            newCap = DEFAULT_INITIAL_CAPACITY;
            newThr = (int)(DEFAULT_LOAD_FACTOR * DEFAULT_INITIAL_CAPACITY);
        }
        if (newThr == 0) {
            float ft = (float)newCap * loadFactor;
            newThr = (newCap < MAXIMUM_CAPACITY && ft < (float)MAXIMUM_CAPACITY ?
                      (int)ft : Integer.MAX_VALUE);
        }
        threshold = newThr;
        @SuppressWarnings({"rawtypes","unchecked"})
        Node<K,V>[] newTab = (Node<K,V>[])new Node[newCap];
        table = newTab;
        if (oldTab != null) {
            // Actual operation of capacity expansion
            for (int j = 0; j < oldCap; ++j) {
                Node<K,V> e;
                if ((e = oldTab[j]) != null) {
                    oldTab[j] = null;
                    if (e.next == null)
                        newTab[e.hash & (newCap - 1)] = e;
                    else if (e instanceof TreeNode)
                        ((TreeNode<K,V>)e).split(this, newTab, j, oldCap);
                    else { // preserve order
                        Node<K,V> loHead = null, loTail = null;
                        Node<K,V> hiHead = null, hiTail = null;
                        Node<K,V> next;
                        do {
                            next = e.next;
                            // If the calculation is 0, it is still the original position
                            if ((e.hash & oldCap) == 0) {
                                if (loTail == null)
                                    loHead = e;
                                else
                                    loTail.next = e;
                                loTail = e;
                            }
                            else {
                                // If it is 1, put it in the position of 1 + oldcap
                                if (hiTail == null)
                                    hiHead = e;
                                else
                                    hiTail.next = e;
                                hiTail = e;
                            }
                        } while ((e = next) != null);
                        if (loTail != null) {
                            loTail.next = null;
                            newTab[j] = loHead;
                        }
                        if (hiTail != null) {
                            hiTail.next = null;
                            newTab[j + oldCap] = hiHead;
                        }
                    }
                }
            }
        }
        return newTab;
    }

Although the source code here is a little long, the logic is still very clear, and the key code is actually a little bit. Look at the following loop, which is the core code of replacement

for (int j = 0; j < oldCap; ++j) {
                // Declaration object
                Node<K,V> e;
                // If the current index data has a value, assign it to e
                if ((e = oldTab[j]) != null) {
                    // Empty old node [] current index data
                    oldTab[j] = null;
                    if (e.next == null)
                        // If there is only one current node and there is no conflict, recalculate the index and reallocate it
                        newTab[e.hash & (newCap - 1)] = e;
                    else if (e instanceof TreeNode)
                        // If the current node is a tree node [I didn't see the source code, ha ha, it shouldn't be difficult, see for yourself]
                        ((TreeNode<K,V>)e).split(this, newTab, j, oldCap);
                    else { // preserve order
                        Node<K,V> loHead = null, loTail = null;
                        Node<K,V> hiHead = null, hiTail = null;
                        Node<K,V> next;
                        do {
                            next = e.next;
                            if ((e.hash & oldCap) == 0) {
                                // 1.1 if and 16 (10000) do the & operation, 0 is obtained
                                if (loTail == null)
                                    loHead = e;
                                else
                                    loTail.next = e;
                                loTail = e;
                            }
                            else {
                                // 2.1 if you do the & operation with 16 (10000), you get 1
                                if (hiTail == null)
                                    hiHead = e;
                                else
                                    hiTail.next = e;
                                hiTail = e;
                            }
                        } while ((e = next) != null);
                        if (loTail != null) {
                            // 1.2 do not change its position
                            loTail.next = null;
                            newTab[j] = loHead;
                        }
                        if (hiTail != null) {
                            // 2.2 place it in the position of 1 + 16
                            hiTail.next = null;
                            newTab[j + oldCap] = hiHead;
                        }
                    }
                }
            }

Look at my note. This is different from our guess. The difference is that it adopts a mechanism, that is, the current value hash and oldcap do the & and operation. If it is 0, the position will not be changed. If it is 1, the position will become the original position + oldcap. This design is also a little clever. I think it may be to make the newly generated map data as evenly distributed as possible and improve query efficiency.

3, We know that the hash initialization length requires a length of 16 multiples. Why, if it is not a multiple of 16, what will be the impact?

We have already explained this problem

We mentioned above whether the & operation can achieve the effect of modulo. The source code here is to modulo 15 (1111). Any int type data and (1111) will not be greater than (1111), and all values may be involved. Therefore, if the initialization here is not a multiple of 16, such as the initialization length 12 (1100), we will find that, Here & there are only four results that can be calculated by the operation (1000110001000000), so this is why the initialization must be a multiple of 16.

The length of hashmap is a multiple of 16, so that data can have a foothold in all indexes, reduce conflicts and improve query efficiency

4. What problem does tail interpolation solve?

The asynchronous linked list dead cycle problem is solved without analysis.

If this article is helpful to you, let's give it a compliment. If the explanation is inappropriate, you can discuss it in the comment area

Posted by naturalbeauty7 on Sat, 20 Nov 2021 06:19:19 -0800