Efficient implementation of a dictionary tree structure

Keywords: Java

Experience of Dictionary Tree

Common Dictionary Tree Implementation Method

class Node{                            
  uint node ;
  uint[] next;  
};

Or something like this

class Node{
  uint node;
  map<> next;
}

The first guarantees the search efficiency, but for the sparse array of dictionary tree, the space utilization is relatively low, especially when encountering Chinese and Japanese, which will cause a great waste of space. Therefore, part of the choice of the second implementation, of course, the second can guarantee space, but at the expense of efficiency on the basis of improvement. map's search speed O(logn) is certainly not as fast as array O(1).

Another efficient implementation

Here's another implementation, which stores dictionary trees with base and check arrays, not only compresses the arrays, but also does not reduce the search efficiency. This is the Double Array Dictionary Tree. The principle of this dictionary tree is very simple, that is, the following equation

if (check[base[pre] + string.charAt(i) ] == pre) 
   pre = base[pre] + string.charAt(i)

For base and check arrays, although the author did not specifically describe the meaning of the double-array dictionary tree, in short, base arrays and check arrays are more like a parent-child relationship. The check array stores the parent node of the current node.

And the paper I saw that day is an improvement on this basis. The title of the paper is accompanied by a subtitle named Reduced Trie.
The most obvious difference between it and the original double-array dictionary tree is that it has an additional tail array, which has the same meaning as its English meaning. ReducdTrie's base, check array stores only the prefix part, not the prefix part, all in the tail array.

So how do you locate tail arrays? In the base array, the base value of the character at the end of each string is the negative value of the subscript of the suffix tail. For example, base [10]= - 12 does not mean that tail[12] arrays are suffixes until the end of the character.

The advantage is obvious. It saves the space of base and check, reduces the calculation of non-prefix part, and also has the disadvantage. The biggest disadvantage is that the dictionary tree is easily compressed and stored offline as a file because only base, check array and two pairs of arrays are needed before. Tail array and base, check array have nothing to do with each other. This requires two file stores. Another point is that there are too many fragments of tail arrays, which will be explained later by insertion.

Concrete realization

In fact, after talking about the function of tail array, combined with the double array dictionary tree, we can quickly understand this structure.

insert

Insertion is a more complex part. It is divided into four cases. The original paper is as follows

1. Insertion of the new word when the double-array is empty. 
2. Insertion of the new word without any collisions. 
3. Insertion of the new word with a collision; in this case, additional characters must be added to the BASE and characters must be removed from the TAIL array to resolve the collision, but nothing already in the BASE array must be removed. 
4. When insertion of the new word with a collision as in case 3 occurs, values in the BASE array must be moved.

Turn it around, roughly.

1. When the array is empty, insert the new node directly.
2. Insertion without any conflict.
3. When inserting a new node, a base array does not need to be modified, but tail must be modified and base characters must be added to resolve conflicts.
4. The insertion of a new node occurs in a situation similar to 3, but the value of the base array must be cleared

If you have implemented a double-array dictionary tree, you will soon understand what the author meant by the conflict. If it has not been realized, then in plain language, there are the following two kinds.

When base is negative (base [i] < 0 indicates that this node is the endpoint) and requires new insertion nodes, then this is a conflict. To solve this conflict, we must compare tail arrays. For example, b achelor and badge, when inserting b achelor, b node is in the tree, achelor is in the tail array, so when inserting badge, b node matching is finished, base is negative, then we need to compare a in badge and a in tail. Because a = a, inserting a new node directly allows you to continue.

What if the nodes are different? Again, in the example above, after walking through a, we come to (c, d) obviously not equal, so we need to build two new nodes. The principle is the same as that mentioned above.

When base [pre] > 0, and check [base [pre] + char]!= pre, then there is a conflict that needs to change the base value. The essence of this conflict is that the base values of two nodes are the same, so one of them must be changed to resolve the conflict. The node of base that needs to be changed is one of pre and check [base [pre] + char].

So which one to change? Choose fewer child nodes from two nodes. Because the change of base value means that the base value of their children's nodes will also change. If the children of the nodes have children temporarily called grandchildren, then the father of the grandchildren's nodes will also change. If the child node has a new value, then the old value can be cleared to zero. New nodes can be inserted.

But there is a big pit in this, that is, if the conflict between the two nodes happens to be the father-son relationship, then we must update the father's subscription.

Maybe what I said above is still not clear up, so I will release the implementation code to solve the above two conflicts.

Where xCheck(List list) finds a value Q from 1, so that any variable x in the list satisfies check[q+x]=0.
moveTail(int x) moves one bit to the left in the interval between the start of X and the end of the character.
WritteTail (int [] value, int x) writes from the x position of the value array to the tail array.
put(int key, value) caches the children of the pre node
 base[1] = 1 check[1] = 1 at initialization time

The first conflict

if (base[pre] < 0) {
    //if current node is an end-point ,then separate or create a new node
    int oldBase = base[pre];
    if (tail[-oldBase] == keyValue[i]) {
        //create a new node
        base[pre] = xCheck(keyValue[i]);
        base[ base[pre]+keyValue[i] ] = oldBase;
        check[ base[pre]+keyValue[i] ] = pre;
        put(pre, keyValue[i]);
        moveTail(-oldBase);
        pre = base[pre] + keyValue[i];
        continue;
    } else {
        //separate
        List<Integer> list = new ArrayList<>();
        list.add(tail[-oldBase]); list.add(keyValue[i]);
        base[pre] = xCheck(list);
        base[ base[pre]+tail[-oldBase] ] = oldBase;
        base[ base[pre]+keyValue[i] ] = -position;
        check[ base[pre]+tail[-oldBase] ] = check[ base[pre]+keyValue[i] ] = pre;
        writeTail(keyValue, i+1);
        put(pre, tail[-oldBase]);
        put(pre, keyValue[i]);
        moveTail(-oldBase);
        break;// 2 new nodes
    }
}

The second kind of conflict

@ Param node 1 is the current node location
@ Param node 2 is another existing node that conflicts with node 1
@ param newNodeValue is the node value to be inserted

Please pay attention to the Big Pit: ())

public int processConflict(int node1, int node2, int newNodeValue) {
    int node = (lists[node1].size()+1) < lists[node2].size() ? node1 : node2;
    int oldNodeBase = base[node];
    if (node == node1) {
        base[node] = xCheck(lists[node], newNodeValue);
    } else {
        base[node] = xCheck(lists[node]);
    }
    for (int i = 0; i < lists[node].size(); i++) {
        int oldNext = oldNodeBase + lists[node].get(i);
        int newNext = base[node] + lists[node].get(i);
        if (oldNext == node1) node1 = newNext;//Giant pit point
        base[newNext] = base[oldNext];
        check[newNext] = node;
        if (base[oldNext] > 0) {
            for (int j = 0; j < lists[oldNext].size(); j++) {
                check[ base[oldNext] + lists[oldNext].get(j) ] = newNext;
                put(newNext, lists[oldNext].get(j));
            }
            lists[oldNext] = null;
        }
        base[oldNext] = 0; check[oldNext] = 0;
    }
    base[ base[node1] + newNodeValue ] = -position;
    check[ base[node1] + newNodeValue ] = node1;
    put(node1, newNodeValue);
    return node;
}

If you still have doubts, I recommend reading the original or translated version of An Efficient Implementation of Trie Structures.

search

Search is very simple, according to the beginning of the article

public boolean search(int[] key) {
    int pre = 1;
    for (int i = 0; i < key.length; i++) {
        if (base[pre] < 0) {
            return compareTail(-base[pre], i, key);
        } else if (base[pre] > 0) {
            if (check[ base[pre] + key[i] ] == pre) {
                pre = base[pre] + key[i];
            } else {
                return false;
            }
        } else return false;
    }
    return true;
}

delete

For deletion, you just need to find the last node of each word and release it.

public boolean delete(String key) {
    int []keyValue = string2IntArray(key);
    int pre = 1;
    int index = -1;
    int tempVal;
    int next;
    do {
        index++;
        tempVal = keyValue[index];
        next = base[pre] + tempVal;
        if (check[next] != pre)  {
            return false;
        }
        if (base[next] < 0) break;
        pre = next;
    } while (true);
    if (tempVal == END_FLAG || compareTail(-base[next], index+1, keyValue)) {
        for (int i = 0; i < lists[pre].size(); i++) {
            if (lists[pre].get(i) == tempVal) {
                lists[pre].remove(i);break;
            }
        }
        base[next] = 0; check[next] = 0;
        //info(String.format("%s next[%d] turn to 0",key, next));
        return true;
    }
    return false;
}

usage

The most common use of dictionary trees is prefix matching and prefix hints. Once the trie tree is established successfully, all possible strings starting from this node can be found according to the input prefix. Depth-first traversal is used here.

private void find( int pre, StringBuilder builder, List<String> list) {
    int next;
    if (base[pre] < 0) {
        builder.append(readTail(-base[pre]));
        list.add(builder.toString());
        return;
    }
    for (int i = 0; i < lists[pre].size(); i++) {
        next = base[pre] + lists[pre].get(i);
        StringBuilder reserved = new StringBuilder(builder.toString());
        if (check[next] == pre) {
            if (lists[pre].get(i) == END_FLAG) {
                find(next, builder, list);
            } else {
                find(next, builder.append((char) lists[pre].get(i).intValue()), list);
            }
        }
        builder = reserved;
    }
}

summary

Okay, remember one of the drawbacks of this structure that I mentioned earlier is the severity of fragmentation of the tail array in the build process? Why do you say that, because when dealing with the first kind of conflict, tail arrays that move constantly, such as b achelor and badge, were originally a chelor Block movement.

I saw a view in the forum that this writing method has such a high probability of conflict when inserting, what is the practicability? In fact, it doesn't matter if the insertion process is slower. The use of dictionary tree depends mainly on its search efficiency. We can build and store the value of each node for a long time, read it directly into memory for the second time, and use it directly. The establishment process only needs one time.

Test source code: ReducedTrie

Posted by YappyDog on Mon, 15 Apr 2019 13:54:33 -0700

Programmer Group