Algorithms Note 9 - Hash List

Hash function
Hash list based on zipper method
Realization
performance
Hash List Based on Linear Detection
Realization
performance

If all keys are small integers, you can use an array as an unordered symbol table and the key as an index of the array, where the value stored at the corresponding location is the value of the key.This allows quick access to any key.A Hash list is based on this approach, but it can handle more complex data types.

There are two steps to using a hash lookup algorithm: the first step is to convert keys into an index of an array through a hash function. Ideally, the corresponding values of keys can be retrieved from the index, but there will actually be multiple keys with hash values pointing to the same index, which requires the second step: handling collision conflicts.

Hash function

A hash function converts a key to an index of an array.If the capacity of the array is M, then the hash function should be able to convert any key to an index within the range of the array.A qualified hash function needs to satisfy the following conditions:

Consistency, equivalence keys must produce equal hash values;
Efficient, hash functions should be simple to calculate with little overhead
Uniformity, which evenly hashes all keys

Java has a hashCode method built into each data type, and its return value is a 32-bit integer.The hashCode method for each data type is consistent with the equals() method, that is, if a.equals(b), a.hashCode()=b.hashCode(), but if a.hashCode()=b.hashCode(), A is not necessarily equal to B because there are collision problems and equals() is a need to be used for judgment.The default hashCode method returns the object's memory address, but only in rare cases. Java overrides the hashCode() method for many common data types, such as String, Integer, Double, File, and so on.

When implementing hash lists, you need to map keys to the index of the array, and the following hash() methods are based on hashCode():


private int hash(Key key) {

return (key.hashCode() & 0x7fffffff) % m;

}

The return value of hashCode() is a 32-bit integer, which can be manipulated with 0x7fffffff to shield the sign bits, change them to a 31-bit non-negative integer, and balance them with the array capacity m to ensure that the hash values fall within the index range of the array.When m is a prime number, the hash values will be more even.

Hash list based on zipper method

#####Implementation

After the hash function converts the key to an array index, the next thing to do is handle collision conflicts.One method is to point each element in the array to a chain table where each node stores a key-value pair whose hash value is the index of that element.This is the zipper method.To find an element from a zipper-based hash list, first find the corresponding chain list based on the hash values, and then find the corresponding keys in the chain list order.To ensure efficient lookup, the capacity M value of the array should be large enough that the average length of the resulting chain table will be shorter.

The following is the implementation of a Hash list based on zipper method:


public class SeparateChainingHashST<Key, Value> {

private static final int INIT_CAPACITY = 4;

  

private int n; // count of k-v pairs

private int m; // size of hashtable

private SequentialSearchST<Key, Value>\[\] st;

  

public SeparateChainingHashST() {

this(INIT_CAPACITY);

}

  

public SeparateChainingHashST(int M) {

this.m = M;

st = (SequentialSearchST<Key, Value>\[\]) new SequentialSearchST\[M\];

for (int i = 0; i < M; i++) {

st\[i\] = new SequentialSearchST();

}

}

  

private int hash(Key key) {

return (key.hashCode() & 0x7fffffff) % m;

}

  

public Value get(Key key) {

if (key == null)

throw new IllegalArgumentException("argument to get() is null");

return (Value) st\[hash(key)\].get(key);

}

  

public void put(Key key, Value val) {

if (key == null)

throw new IllegalArgumentException("first argument to put() is null");

  

if (val == null) {

delete(key);

return;

}

// double table size if average length of list >= 10

if (n >= 10 * m)

resize(2 * m);

  

int i = hash(key);

if (!st\[i\].contains(key))

n++;

st\[i\].put(key, val);

}

  

public void delete(Key key) {

if (key == null)

throw new IllegalArgumentException("argument to delete() is null");

  

int i = hash(key);

if (st\[i\].contains(key))

n--;

st\[i\].delete(key);

  

if (m > INIT_CAPACITY && size() < 2 * m)

resize(m / 2);

}

  

public void resize(int chains) {

SeparateChainingHashST<Key, Value> temp = new SeparateChainingHashST<Key, Value>(chains);

for (int i = 0; i < m; i++) {

for (Key key : st\[i\].keys()) {

temp.put(key, st\[i\].get(key));

}

}

this.m = temp.m;

this.n = temp.n;

this.st = temp.st;

}

  

public int size() {

return n;

}

}

Each position of the array points to a list of chains that can be searched sequentially. The average length of the list is n/m. The shorter the average length, the more efficient the search is, but the larger the memory space required.Here, the resize method controls the value of n/m (that is, the average length of the chain table) between 2 and 10.


public void put(Key key, Value val) {

...

// double table size if average length of list >= 10

if (n >= 10 * m)

resize(2 * m);

...

}

  

public void delete(Key key) {

...

if (m > INIT_CAPACITY && size() < 2 * m)

resize(m / 2);

}

#####Performance

Ideally, a hash function can distribute all keys evenly and independently between 0 and M-1. Although it is not possible to find a hash function that satisfies this requirement completely in practice, experiments can also verify that the hash() method above has a very close distribution result to the ideal situation, so the properties of the hash based on this homogeneous hash hypothesis are analyzed.Can have practical reference significance.

Based on the even hash hypothesis, if the average length of a chain table is n/m, its performance characteristics are consistent with those of an unordered chain table:

Both missed lookups and newly inserted elements require n/m comparisons;
For hit lookups, the worst case is n/m comparisons, and on average, based on previous analyses of random hit patterns, about half of the elements need to be compared.

Hash List Based on Linear Detection

#####Implementation

Another way to implement hash lists is to save N key-value pairs with arrays of size M, where M>N, and empty spaces in the array can be used to solve collision problems.The approach based on this strategy is called an open address hash list.Linear detection is the simplest of these methods.When a collision occurs, check the next position in the hash list.In a specific run, when looking for an element, if the hash value points to the same key as the one being searched, the search hits; if the key being pointed to is empty, it does not hit; if the key being pointed to is different from the one being searched, the comparison continues with the next location.


public class LinerProbingHashST<Key, Value> {

private int n; // count of k-v pairs

private int m = 16; // size of hashtable

private Key\[\] keys;

private Value\[\] vals;

  

public LinerProbingHashST() {

this(16);

}

  

public LinerProbingHashST(int capaticy) {

keys = (Key\[\]) new Object\[capaticy\];

vals = (Value\[\]) new Object\[capaticy\];

}

  

private int hash(Key key) {

return (key.hashCode() & 0x7fffffff) % m;

}

  

public Value get(Key key) {

if (key == null)

throw new IllegalArgumentException("argument to get() is null");

for (int i = hash(key); keys\[i\] != null; i = (i + 1) % m) {

if (keys\[i\].equals(key)) {

return vals\[i\];

}

}

return null;

}

  

public void put(Key key, Value val) {

if (key == null)

throw new IllegalArgumentException("first argument to put() is null");

  

// double table size if average length of list >= 10

if (n >= m / 2)

resize(2 * m);

  

int i;

for (i = hash(key); keys\[i\] != null; i = (i + 1) % m) { //

if (keys\[i\].equals(key)) {

vals\[i\] = val;

return;

}

}

keys\[i\] = key;

vals\[i\] = val;

n++;

}

  

public void resize(int capacity) {

LinerProbingHashST<Key, Value> st = new LinerProbingHashST<Key, Value>(capacity);

for (int i = 0; i < m; i++) {

if (keys\[i\] != null) {

st.put(keys\[i\], vals\[i\]);

}

keys = st.keys;

vals = st.vals;

m = st.m;

}

}

  

public int size() {

return n;

}

}

When exploring arrays, changing the value of the index with I = (i + 1)%m ensures that the index does not exceed the array range and will fold back to the beginning when reaching the end of the array.

#####Performance

In zipper methods, n/m represents the average length of the chain list, which is generally greater than 1, but in linear detection methods, n/m is not greater than 1 because n must be less than or equal to M.You can think of n/m as the usage of a hash list.Since linear detection determines hits by detecting whether a location is empty, usage is not allowed to reach 1, otherwise an infinite loop will occur.And even when the usage is close to 1, the performance of hash lists has become very poor.Relevant studies show that when the usage rate is less than 0.5, the number of probes is about 1.5 to 2.5 times.In the above implementation, resize() is called when n>=m/2 to double the capacity of the array to maintain a low usage.

###Summary

It is clear that hash lists implemented either way are a trade-off between time and space.If there are no memory limitations, even if the amount of data is very large, the keys can be directly indexed as a very large array, so that the lookup can be completed by accessing the memory only once; if there are no time limitations, the data can be stored in an unordered array and searched sequentially.A hash table uses moderate space and time and strikes a balance between the extremes.You can make a trade-off between space and time by adjusting the parameters of the hash list.

Posted by thechris on Sat, 25 Jan 2020 17:00:30 -0800

Programmer Group

Algorithms Note 9 - Hash List

Hot Keywords