AC automaton based on double array trie tree (with explanation of JAVA code)

Keywords: data structure

AC Automation Based on Double Array trie Tree

We've already covered that before AC Automation However, in practice, if the dictionary tree that needs to be built is very large, the original version of AC automaton will take more time to query, and based on Double array trie tree AC automata can just make up for this.

Below we will implement based on hankcs AhoCorasickDoubleArrayTrie The code explains how to build an AC automaton for a double-array trie tree and how to query it.

Build Double Array trie Tree AC Automation

The construction of a double-array trie-tree AC automaton is divided into three steps:

  1. Building trie tree
  2. Constructing a Double Array from a trie Tree
  3. fail and output tables required to build AC automata
public void build(Map<String, V> map)
        {
            // Save Value
            v = (V[]) map.values().toArray();
            l = new int[v.length];
            Set<String> keySet = map.keySet();
            // Building a Bipartite trie Tree
            addAllKeyword(keySet);
            // Double Array trie Tree Based on Binary trie Tree
            buildDoubleArrayTrie(keySet.size());
            used = null;
            // Build the failure table and merge the output table
            constructFailureStates();
            rootState = null;
            loseWeight();
        }

Building trie tree

Here addAllKeyword(keySet) builds a trie tree heel AC Automation In the same way, keystring needs to be added to emits at the tail node, but since the double-array trie-tree AC automaton only stores arrays, not tree structures, all words are stored in the array v, whereas keyword is added to emits as index in the array V

private void addKeyword(String keyword, int index)
        {
            State currentState = this.rootState;
            for (Character character : keyword.toCharArray())
            {
                currentState = currentState.addState(character);//trie tree add node
            }
            currentState.addEmit(index);//index to emits with a word added at the end node
            l[index] = keyword.length();
        }

Build Double Array

  • See Double array trie tree First, by using an outer loop, we find an begin that satisfies the current node tCurrent and all its child nodes siblings:
  1. begin = base[tCurrent]
  2. For all child nodes sibling: location begin + code(sibling) is not occupied
outer:
            while (true)
            {
                pos++;//Whenever a sibling location begin + code(sibling) is occupied, pos adds one

                if (allocSize <= pos)
                    resize(pos + 1);

                if (check[pos] != 0)
                {
                    nonzero_num++;
                    continue;
                }
                else if (first == 0)
                {
                    nextCheckPos = pos;
                    first = 1;
                }

                begin = pos - siblings.get(0).getKey(); //Here begin is used to record the base value of tCurrent at this time
                if (allocSize <= (begin + siblings.get(siblings.size() - 1).getKey()))
                {
                    // progress can be be zero //Prevent progresses from generating divide-by-zero errors
                    double toSize = Math.max(1.05, 1.0 * keySize / (progress + 1)) * allocSize;
                    int maxSize = (int) (Integer.MAX_VALUE * 0.95);
                    if (allocSize >= maxSize) throw new RuntimeException("Double array trie is too big.");
                    else resize((int) Math.min(toSize, maxSize));
                }

                if (used[begin])
                    continue;

                for (int i = 1; i < siblings.size(); i++)
                    if (check[begin + siblings.get(i).getKey()] != 0)//Indicates that the location begin + siblings.get(i).getKey() has been occupied
                        continue outer;

                break;
            }

Note: The code uses check[begin + siblings.get(i).getKey()] to determine if the location begin + code(sibling) has been occupied or if it is not zero, to indicate that the location has been occupied

  • Once the base value of the parent node tCurrent is found, you can set the check value of all child node siblings to base[tCurrent]:
for (Map.Entry<Integer, State> sibling : siblings)
     {
          check[begin + sibling.getKey()] = begin;
     }

although Double array trie tree Check[child_index] = father_is described Index, but since base is monotonic, check[child_index] can be set directly to base[father_index]

  • Next, the child node siblings is handled in two cases:
  1. If sibling is a tail node, its base value can directly inherit the base value of the parent node
  2. If sibling is not a tail node, it is added to the siblingQueue, repeating the above procedure, and verifying the base value of the node by checking if its child node index is in conflict
            for (Map.Entry<Integer, State> sibling : siblings)
            {
                List<Map.Entry<Integer, State>> new_siblings = new ArrayList<Map.Entry<Integer, State>>(sibling.getValue().getSuccess().entrySet().size() + 1);

                if (fetch(sibling.getValue(), new_siblings) == 0) // Indicates that the current child node sibling is a leaf node
                {
                    base[begin + sibling.getKey()] = (-sibling.getValue().getLargestValueId() - 1);
                    progress++;
                }
                else //For siblings with child nodes, add them to the siblingQueue and repeat the loop above
                {
                    siblingQueue.add(new AbstractMap.SimpleEntry<Integer, List<Map.Entry<Integer, State>>>(begin + sibling.getKey(), new_siblings));
                }
                sibling.getValue().setIndex(begin + sibling.getKey());//sibling's index is deterministic
            }

Note: All index es of siblings for tCurrent are known whether or not they are tail nodes. So there is sibling.getValue().setIndex(begin + sibling.getKey()) at the end;

  • The final step is to set the base value of tCurrent to begin:
	Integer parentBaseIndex = tCurrent.getKey();
	if (parentBaseIndex != null)
	{
    	base[parentBaseIndex] = begin;
	}

Build fail and output

The search process for fail node follows AC Automation The only difference here is that both fail and output need to be made into arrays.

  • fail array:

Assuming index is the node of i and index of fail node is j, then there are:
f a i l [ i ] = j fail[i] = j fail[i]=j

    public void setFailure(State failState, int fail[])
    {
        this.failure = failState;
        fail[index] = failState.index;
    }
  • output array:

The output array is used to place all emits for each location node:
o u t p u t [ S t a t e . i n d e x ] = S t a t e . e m i t s output[State.index] = State.emits output[State.index]=State.emits
Since State.emits itself is a one-dimensional array, output is a two-dimensional array.

        private void constructOutput(State targetState)
        {
            Collection<Integer> emit = targetState.emit();
            if (emit == null || emit.size() == 0) return;
            int[] output = new int[emit.size()];
            Iterator<Integer> it = emit.iterator();
            for (int i = 0; i < output.length; ++i)
            {
                output[i] = it.next();
            }
            AhoCorasickDoubleArrayTrie.this.output[targetState.getIndex()] = output;
        }

Query of Double Array trie Tree AC Automation

  • Text is the text to query. Assuming that the pointer has reached the node currentState and the text has reached the position i, first use getState() to find if there is text(i) in the child node of currentState, and if there is, return the index of this node; If the word is not present in the child node, it is found and returned from the child node of the fail node of the current state:
    private int getState(int currentState, char character)
    {
        int newCurrentState = transitionWithRoot(currentState, character);  // Jump by success first
        while (newCurrentState == -1) // If the jump fails, press failure to jump
        {
            currentState = fail[currentState];
            newCurrentState = transitionWithRoot(currentState, character);
        }
        return newCurrentState;
    }
  • But now that we store an array instead of a trie tree, looking for the character C in a child node of currentState first finds where the currentState is moved by the character c based on base[currentState] + code(c). Next, we determine the correctness of the transfer by checking whether check[base[currentState] + code(c)] = base[currentState]. If the transfer is correct, return to base[currentState] + code(c) where the transfer was reached
    protected int transitionWithRoot(int nodePos, char c)
    {
        int b = base[nodePos];
        int p;

        p = b + c + 1;// p is the index of the node that the node at nodePos moves to by the character c
        if (b != check[p])//Check if the node found from c, whose parent is a node at the nodePos location
        {
            if (nodePos == 0) return 0;
            return -1;
        }

        return p;
    }
  • With the index of the transferred node, emits can be indexed to that node using the ouput[] array, where emits are the index of words, where the word hit of the text can be restored using the array v[] where the words are stored:
    private void storeEmits(int position, int currentState, List<Hit<V>> collectedEmits)
    {
        int[] hitArray = output[currentState];//emits array indexed to current state using output
        if (hitArray != null)
        {
            for (int hit : hitArray)
            {
                collectedEmits.add(new Hit<V>(position - l[hit], position, v[hit]));//Restore Words with Array v
            }
        }
    }

Posted by zoobooboozoo on Mon, 06 Dec 2021 10:18:18 -0800