HashMap for Java Containers is very detailed - most frequently used, but do you really understand?

Preface

Learning record

Learning record
- Time: week 3
- SMART sub-goal: Java container

Recorded in learning Java container knowledge points, about HashMap needs to focus on recorded knowledge points.

Overview of knowledge points:

I. hashCode()
2. HashMap underlying implementation
1. brief introduction
2. storage structure
3. Important attributes
4. Adding element operations
  Q: Why is the length of HashMap the default initial length of 16, and every time resize(), the length must be a power of 2?
5. HashMap Expansion
  Q: HashMap Dead Chain Problem
6. Java 8 versus Java 7
7. Why use red and black trees?
III. CONCLUSION

I. hashCode()

In the Object class, the hashCode() method is a native ly modified class, and JavaDoc describes returning the hash value of the object.

So what's the effect of the hash value as a return value?

It mainly guarantees hash-based collections, such as HashSet, HashMap and HashTable, to ensure that elements are not duplicated when inserting elements, and to improve the efficiency of insertion and deletion of elements; it mainly exists for the convenience of searching.

Take Set for example.

As we all know, Set sets can not be repeated. If new elements are compared one by one with the internal elements of the set, the efficiency of inserting 100,000 pieces of data can be said to be very low.

So when adding data, the application of hash table appears. Hash algorithm is also called hash algorithm. When adding a value, it first calculates its hash value and inserts the data into the specified location according to the calculated hash value. In this way, the efficiency problem of using equal() comparison all the time is avoided.

Specifically manifested in:

If the specified location is empty, add it directly
If the specified location is not empty, call equal() to determine whether the two elements are the same, and if they are the same, they are not stored.

In the second case, if the two elements are different, but hashCode() is the same, it is what we call a hash collision.

The probability of hash collision depends on the hashCode() method of calculation and the size of space capacity.

In this case, a linked list is created at the same location, and elements with the same key value are stored in the linked list.

In HashMap, zipper method is used to solve hashCode conflict.

summary

HashCode is the identifier of an object, and hashCode of an object in Java is an int type value. Specifying the index of an array by hashCode can quickly locate the location of the object in the array, and then traverse the list to find the corresponding value. Ideally, the time complexity is O(1), and different objects can have the same hashCode.

2. HashMap underlying implementation

0. Introduction

HashMap is implemented by the hash table based Map interface and exists in the form of Key-Value storage.
Non-thread security;
key value can be null;
Mapping in HashMap is not orderly;
In JDK 1.8, HashMap is composed of array + linked list + red-black tree, adding red-black tree as the underlying data structure.
When the length of a linked list stored in a hash bucket is greater than 8, the linked list will be converted into a red-black tree, and when the length is less than 6, the linked list will be converted from a red-black tree to a linked list.
Source codes before 1.8 and after 1.8 are quite different.

1. Storage structure

In JDK 1.8, HashMap is composed of arrays + linked lists + red-black trees, adding red-black trees as the underlying data structure.

Hash is used to confirm the position of the array. If a hash collision occurs, it is stored as a linked list. But if the linked list is too long, HashMap will convert the linked list into a red-black tree for storage, with a threshold of 8.

The following is the structure of HashMap:

2. Important attributes

2.1 table

      /**
     * The table, initialized on first use, and resized as
     * necessary. When allocated, length is always a power of two.
     * (We also tolerate length zero in some operations to allow
     * bootstrapping mechanics that are currently not needed.)
     */
    transient Node<K,V>[] table;

In JDK 1.8, we learned that HashMap is a structure consisting of arrays, linked lists and red-black trees, in which table is the array in HashMap.

2.2 size

    /**
     * The number of key-value mappings contained in this map.
     */
    transient int size;

Number of key-value pairs stored in HashMap.

2.3 loadFactor

    /**
     * The load factor for the hash table.
     *
     * @serial
     */
    final float loadFactor;

Load factor. Load factor is a coefficient that weighs resource utilization and allocation space. When the total number of elements > array len gt h * load factor, the expansion operation is performed.

2.4 threshold

    /**
     * The next size value at which to resize (capacity * load factor).
     *
     * @serial
     */
    // (The javadoc description is true upon serialization.
    // Additionally, if the table array has not been allocated, this
    // field holds the initial array capacity, or zero signifying
    // DEFAULT_INITIAL_CAPACITY.)
    int threshold;

Expansion threshold. Threshold = array length * load factor. Expansion operation is performed after exceeding.

2.5 TREEIFY_THRESHOLD/UNTREEIFY_THRESHOLD

    /**
     * The bin count threshold for using a tree rather than list for a
     * bin.  Bins are converted to trees when adding an element to a
     * bin with at least this many nodes. The value must be greater
     * than 2 and should be at least 8 to mesh with assumptions in
     * tree removal about conversion back to plain bins upon
     * shrinkage.
     */
    static final int TREEIFY_THRESHOLD = 8;

    /**
     * The bin count threshold for untreeifying a (split) bin during a
     * resize operation. Should be less than TREEIFY_THRESHOLD, and at
     * most 6 to mesh with shrinkage detection under removal.
     */
    static final int UNTREEIFY_THRESHOLD = 6;

Tree threshold. When the length of a linked list stored in a hash bucket is greater than 8, the linked list will be converted into a red-black tree, and when it is less than 6, it will be converted from a red-black tree to a linked list.

3. Adding Elements

    public V put(K key, V value) {
        return putVal(hash(key), key, value, false, true);
    }

3.1 hash()

You can see that putVal() is the actual operation to add elements. Before putVal() is executed, the hash() method is executed on the key. Let's see what's done inside.

    static final int hash(Object key) {
        int h;
        // key.hashCode(): Returns the hash value, which is hashcode
        // ^ Bit-by-bit XOR
        // ">: No sign to move right, ignore the sign bit, the vacancies are all filled with 0.
        return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
    }

key==null indicates that the case where key is null is supported in HashMap.

The same method in Hashstable is to use key directly to obtain hashCode, and there is no key==null judgment, so Hashstable does not support that key is null.

Come back to this hash() method. This method is called perturbation function in technical terms.

The purpose of using hash() is to prevent some hashCode() methods which are poorly implemented. In other words, it is to reduce hash collisions.

JDK 1.8 hash method is simpler than JDK 1.7 hash method, but the principle remains unchanged. Let's look at what JDK 1.7 does.

        // code in JDK1.7
        static int hash(int h) {
            // This function ensures that hashCodes that differ only by
            // constant multiples at each bit position have a bounded
            // number of collisions (approximately 8 at default load factor).
            h ^= (h >>> 20) ^ (h >>> 12);
            return h ^ (h >>> 7) ^ (h >>> 4);
        }

Compared with the hash method of JDK 1.8, the hash method of JDK 1.7 has a slightly worse performance, because after all, the hash method of JDK 1.7 has been disturbed four times.

3.2 putVal()

Let's look at the putVal() method that actually performs the operation of adding elements.

    final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
                   boolean evict) {
        Node<K,V>[] tab; Node<K,V> p; int n, i;
        // When the array is empty or the length is 0, the resize() method is initialized or expanded.
        if ((tab = table) == null || (n = tab.length) == 0)
            n = (tab = resize()).length;
        // Compute array subscript i = n-1 & hash
        // If there are no elements in this location, create Node coexistence values directly
        if ((p = tab[i = (n - 1) & hash]) == null)
            tab[i] = newNode(hash, key, value, null);
        else {
            // This location has elements
            Node<K,V> e; K k;
            if (p.hash == hash &&
                ((k = p.key) == key || (key != null && key.equals(k))))
                // hash value and key value are equal, and e variable is used to get the reference of the element at the current position, which is later used to replace the existing value.
                e = p;
            else if (p instanceof TreeNode)
                // Currently, it is stored as a red-black tree, and its unique putVal method, putTreeVal, is executed.
                e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value);
            else {
                // Currently, it is stored in a linked list and begins to traverse the linked list.
                for (int binCount = 0; ; ++binCount) {
                    if ((e = p.next) == null) {
                        // Here is the insertion to the end of the list!
                        p.next = newNode(hash, key, value, null);
                        if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
                            // Over the threshold, the storage mode is transformed into red-black tree
                            treeifyBin(tab, hash);
                        break;
                    }
                    if (e.hash == hash &&
                        ((k = e.key) == key || (key != null && key.equals(k))))
                        break;
                    p = e;
                }
            }
            if (e != null) { // existing mapping for key
                V oldValue = e.value;
                if (!onlyIfAbsent || oldValue == null)
                    // onlyIfAbsent if true - does not overwrite existing values
                    // Assign new values in
                    e.value = value;
                afterNodeAccess(e);
                return oldValue;
            }
        }
        // Record the number of modifications
        ++modCount;
        // If the number of elements exceeds the threshold value, the capacity will be enlarged.
        if (++size > threshold)
            resize();
        afterNodeInsertion(evict);
        return null;
    }

3.3 Why does the default initial length of HashMap be 16, and each time resize(), the length must be a power of 2?

This is a common interview question. The design of this problem description is actually intended to serve the Hash algorithm that maps from Key to array subscript index.

As mentioned earlier, in order to make HashMap storage efficient, we should minimize hash collisions, that is to say, elements should be distributed as evenly as possible.

Hash value ranges from 2147483648 to 2147483647, which adds up to about 4 billion mapping spaces. As long as hash function mapping is relatively uniform and loose, collision is very difficult in general applications. But the problem is that it's a 4 billion-dollar array, and there's no room for memory. So this hash value can't be used directly.

So we need a mapping algorithm. This calculation method is (n - 1) & hash which appears in 3.2.

Let's further demonstrate this algorithm:

Suppose you have a key="book"
The hashCode value of book is calculated. The result is 302 9737 in decimal system and 101 11000 11101110 1001 in binary system.
Assuming that the length of HashMap is the default 16, the result of computing Length-1 is 15 decimal and 1111 binary.
To run and calculate the above two results, 10111000111011101001 & 1111 = 1001, decimal is 9, so index=9.

By this way, the hashCode% length can be as effective as the modular operation, which is 3029737% 16 = 9 in the above example.

And the performance is greatly improved by bit operation.

Maybe at this point, you still don't know why the length must be the power of 2, because of this bit operation method.

Length-1 is the power of length 16 or other 2, and the value of Length-1 is all binary bits equal to 1. In this case, the result of index is equal to the value of several bits after HashCode. As long as the input HashCode itself is evenly distributed, the results of the Hash algorithm are uniform. If the length of HashMap is not the power of 2, there will be some index that will never occur, which obviously does not meet the principle and expectation of uniform distribution. So power-of-two expansion s and size must be power of two are always emphasized in the source code.

In addition, the HashMap constructor allows the user to pass in a capacity that is not the n-th power of 2, because it automatically converts the incoming capacity to the n-th power of 2.

/**
 * Returns a power of two size for the given target capacity.
 */
static final int tableSizeFor(int cap) {
    int n = cap - 1;
    n |= n >>> 1;
    n |= n >>> 2;
    n |= n >>> 4;
    n |= n >>> 8;
    n |= n >>> 16;
    return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
}

4. HashMap expansion

Next, let's talk about HashMap expansion.

4.1 Expansion

The initial length of HashMap is 16. Assuming that the key pairs in HashMap have been increasing, but the capacity of the table array remains unchanged, a hash collision will occur, and the efficiency of searching will certainly become lower and lower. So when the number of key values exceeds a certain threshold, HashMap performs the scaling operation.

So how to calculate the threshold of expansion?

Threshold = array length * load factor

threshold = capacity * loadFactor

After each expansion, threshold doubled

The above calculation appears in the resize() method. This method is explained in detail below. Let's move on.

Load Factor, as we mentioned earlier, is a factor that weighs resource utilization and allocation space. Why 0.75? This is actually a trade-off that the author thinks is better. Of course, you can also set the load factor manually through the construction method. Public HashMap {...].

The protagonist resize() method that comes back here

    final Node<K,V>[] resize() {
        // Old Array Reference
        Node<K,V>[] oldTab = table;
        // Old array length
        int oldCap = (oldTab == null) ? 0 : oldTab.length;
        // Old threshold
        int oldThr = threshold;
        // New Array Length, New Threshold
        int newCap, newThr = 0;
        if (oldCap > 0) {
            if (oldCap >= MAXIMUM_CAPACITY) {
                // Old arrays have exceeded their maximum capacity
                // Change the threshold to the maximum, return to the old array directly, and do not operate.
                threshold = Integer.MAX_VALUE;
                return oldTab;
            }
            // newCap has doubled
            else if ((newCap = oldCap << 1) < MAXIMUM_CAPACITY &&
                     oldCap >= DEFAULT_INITIAL_CAPACITY)
                // Perform expansion operation, new threshold = old threshold*2
                newThr = oldThr << 1; // double threshold
        }
        else if (oldThr > 0) // initial capacity was placed in threshold
            // The initial threshold is set manually
            // Array capacity = initial threshold
            newCap = oldThr;
        else {               // zero initial threshold signifies using defaults
            // Initialization operation
            // Array capacity = default initial capacity
            newCap = DEFAULT_INITIAL_CAPACITY;
            // Initial threshold = capacity * default load factor
            newThr = (int)(DEFAULT_LOAD_FACTOR * DEFAULT_INITIAL_CAPACITY);
        }
        if (newThr == 0) {
            // If none of the previous thresholds have been set
            float ft = (float)newCap * loadFactor;
            newThr = (newCap < MAXIMUM_CAPACITY && ft < (float)MAXIMUM_CAPACITY ?
                      (int)ft : Integer.MAX_VALUE);
        }
        // Update threshold
        threshold = newThr;
        @SuppressWarnings({"rawtypes","unchecked"})
            // Create arrays
            Node<K,V>[] newTab = (Node<K,V>[])new Node[newCap];
        // Update the array referenced by the table
        table = newTab;
        if (oldTab != null) {
            // Capacity expansion
            for (int j = 0; j < oldCap; ++j) {
                // Traversing through old arrays
                Node<K,V> e;
                if ((e = oldTab[j]) != null) {
                    // Remove the head node from this position
                    // Cancel old references to facilitate garbage collection
                    oldTab[j] = null;
                    if (e.next == null)
                        newTab[e.hash & (newCap - 1)] = e;
                    else if (e instanceof TreeNode)
                        // Treatment of Red-Black Trees
                        ((TreeNode<K,V>)e).split(this, newTab, j, oldCap);
                    else { // preserve order
                        // Link List Processing This list processing is actually very clever.
                        // Two chains are defined
                        Node<K,V> loHead = null, loTail = null;
                        Node<K,V> hiHead = null, hiTail = null;
                        Node<K,V> next;
                        do {
                            next = e.next;
                            if ((e.hash & oldCap) == 0) {
                                if (loTail == null)
                                    loHead = e;
                                else
                                    loTail.next = e;
                                loTail = e;
                            }
                            else {
                                if (hiTail == null)
                                    hiHead = e;
                                else
                                    hiTail.next = e;
                                hiTail = e;
                            }
                        } while ((e = next) != null);
                        if (loTail != null) {
                            loTail.next = null;
                            newTab[j] = loHead;
                        }
                        if (hiTail != null) {
                            hiTail.next = null;
                            newTab[j + oldCap] = hiHead;
                        }
                    }
                }
            }
        }
        return newTab;
    }

I don't know if you understand the above code, but I was a little dizzy when I first saw it. But after understanding, it feels very clever.

Let's take the list processing as an example. What it does is to divide the list at that location into high and low lists while traversing the old table array. What exactly does that mean? Look at the example below.

There is a HashMap with size 16. There are six elements A/B/C/D/E/F, of which A/B/C has a Hash value of 5 and D/E/F has a Hash value of 21. We know that the method of calculating array subscripts is operation (the effect is equivalent to modular operation). Thus, the index=5 of A/B/C/D/E/F will exist in the position of index=5.
Assuming that they are inserted sequentially, there will be a list of A - > B - > C - > D - > E - > F at index 5.
When this HashMap is to be expanded, we have an old array, oldTable [], with a capacity of 16, and a new array, newTable [], with a capacity of 32 (double the capacity of the expanded array).
When traversing to the position of the old array index=5, we enter the code segment of the list processing mentioned above. We operate Hash & oldCapacity on the elements of the list. After calculating the A/B/C value of Hash 5, it is 0, which is assigned to the low list, and the D/E/F of Hash 21 is assigned to the high list.
Then put the low list in the index=5 position of the new array, and the high list in the index=5+16=21 position of the new array.

The mangrove tree-related operations, though coded differently, actually do the same thing. It is to separate the different Hash size linked list elements in the same position from the new table array. I hope you can understand it here.

4.2 HashMap Dead Chain Problem

Java 7 HashMap will have the problem of dead cycle. The main reason is that in Java 7, after HashMap expansion and transfer, the linked list is inverted in order. During the transfer process, other threads modify the reference relationship of nodes in the original linked list, resulting in the formation of a ring linked list at the location of a Hash bucket. At this time, get(key), if key does not exist at the location of a Hash bucket. The HashMap and the Hash result of the key equals the Hash position that forms the loop list, and the program goes into a dead loop.

Java 8 does not cause a dead cycle under the same premise, because the chain order of Java 8 is unchanged after expansion and transfer, and the reference relationship of the previous nodes is maintained.

See this for a specific image demonstration.Cartoon: HashMap with high concurrency

5. Java 8 versus Java 7

When a hash conflict occurs, Java 7 inserts at the head of the list and Java 8 inserts at the end of the list.
When data is transferred after expansion, the order of linked lists before and after Java 7 transfer will be inverted, and Java 8 will keep the original order.
Java 8, which introduces red-black trees, greatly optimizes the performance of HashMap‘
When the put operation reaches the threshold, Java 7 first expands and then adds elements, while Java 8 first adds elements and then expands them.

6. Why use red and black trees?

A lot of people will probably answer that in order to improve search performance, but more specifically, the use of red-black trees is to improve HashMap performance in the case of extreme hash conflicts.

Here is a benchmark code to test HashMap performance:

import com.google.caliper.Param;
import com.google.caliper.Runner;
import com.google.caliper.SimpleBenchmark;
public class MapBenchmark extends SimpleBenchmark {
     private HashMap < Key, Integer > map;
     @Param
     private int mapSize;
     @Override
     protected void setUp() throws Exception {
          map = new HashMap < > (mapSize);
          for (int i = 0; i < mapSize; ++i) {
           map.put(Keys.of(i), i);
          }
     }
     public void timeMapGet(int reps) {
          for (int i = 0; i < reps; i++) {
           map.get(Keys.of(i % mapSize));
          }
     }
}

class Key implements Comparable < Key > {
    private final int value;
    Key(int value) {
        this.value = value;
    }
    @Override
    public int compareTo(Key o) {
        return Integer.compare(this.value, o.value);
    }
    @Override
    public boolean equals(Object o) {
        if (this == o) return true;
        if (o == null || getClass() != o.getClass())
            return false;
        Key key = (Key) o;
        return value == key.value;
    }
    // @Override
    // public int hashCode() {
    //  return value;
    // }

    @Override
    public int hashCode() {
        // The hash values returned by key are the same
        return 0;
    }
}

public class Keys {
     public static final int MAX_KEY = 10 _000_000;
     private static final Key[] KEYS_CACHE = new Key[MAX_KEY];
     static {
          for (int i = 0; i < MAX_KEY; ++i) {
           KEYS_CACHE[i] = new Key(i);
          }
     }
     public static Key of (int value) {
          return KEYS_CACHE[value];
     }
}

As you can see, the hash value returned by the Key object has been modified to measure the query performance gap between Java 7 and Java 8 versions of HashMap in the same case, that is, in the case of what we call extreme hash conflicts.

The results of Java 7 are predictable. The performance loss of HashMap.get() increases proportionally to the size of HashMap itself. Since all key-value pairs are in the same bucket in a huge list, searching for an entry requires an average of half the traversal of such a list (size n). Therefore, O(n) complexity is visualized on the graph.

In contrast to Java 8, performance improves a lot. In the case of catastrophic hash conflicts, the same benchmark test executed on JDK 8 produces O (logn) worst-case performance.

The optimization of the algorithm here is actually in the JEP-180 Some of them described that,

In addition, if the Key object is not Comparable, there will be no performance improvement when there is a major hash conflict. (Because the red-black tree is the underlying implementation, the order needs to be determined by compare method)

Another person might say, where is such an extreme hash conflict?

This is actually a security consideration, although it is rarely possible to have many conflicts under normal circumstances. Imagine, however, that if Key comes from an untrusted source (such as HTTP header name received from the client), then it is possible to receive a fake key value, which is not difficult, because the hashing algorithm is well known, assuming that someone intends to fake the same key value, then your HashMap will be. This extreme hash conflict has occurred. Now, if you go to this HashMap to execute multiple query requests, you will find that the efficiency of the program to execute queries will become very slow, the cpu occupancy rate is high, and the program will even refuse to provide services to the outside world.

III. CONCLUSION

HashMap's related knowledge is introduced here. It's very long and I've written for a long time.

The feeling of the whole learning process is that knowledge learning should be interpenetrating and verifying each other. HashMap actually has many cross-cutting implementation principles with other Maps. For example, Concurrent HashMap is roughly the same as HashMap, but the former uses segmented locks to ensure thread safety, and the underlying principles of Hashstable and HashMap are also the same. Similarly, Hashstable uses synchronized synchronization, and the official hashstable has not been updated since it came out. It belongs to the obsolete class; both the bottom of HashMap and TreeMap involve red and black trees. When you study in contrast, you will find that you have learned one of them, and the others have understood almost the same, because many principles are interpolated and applied.

The next chapter will cover other Map types and compare them.

Reference resources

Posted by steanders on Mon, 29 Jul 2019 21:06:25 -0700

Programmer Group