AC Automation Based on Double Array trie Tree
We've already covered that before AC Automation However, in practice, if the dictionary tree that needs to be built is very large, the original version of AC automaton will take more time to query, and based on Double array trie tree AC automata can just make up for this.
Below we will implement based on hankcs AhoCorasickDoubleArrayTrie The code explains how to build an AC automaton for a double-array trie tree and how to query it.
Build Double Array trie Tree AC Automation
The construction of a double-array trie-tree AC automaton is divided into three steps:
- Building trie tree
- Constructing a Double Array from a trie Tree
- fail and output tables required to build AC automata
public void build(Map<String, V> map) { // Save Value v = (V[]) map.values().toArray(); l = new int[v.length]; Set<String> keySet = map.keySet(); // Building a Bipartite trie Tree addAllKeyword(keySet); // Double Array trie Tree Based on Binary trie Tree buildDoubleArrayTrie(keySet.size()); used = null; // Build the failure table and merge the output table constructFailureStates(); rootState = null; loseWeight(); }
Building trie tree
Here addAllKeyword(keySet) builds a trie tree heel AC Automation In the same way, keystring needs to be added to emits at the tail node, but since the double-array trie-tree AC automaton only stores arrays, not tree structures, all words are stored in the array v, whereas keyword is added to emits as index in the array V
private void addKeyword(String keyword, int index) { State currentState = this.rootState; for (Character character : keyword.toCharArray()) { currentState = currentState.addState(character);//trie tree add node } currentState.addEmit(index);//index to emits with a word added at the end node l[index] = keyword.length(); }
Build Double Array
- See Double array trie tree First, by using an outer loop, we find an begin that satisfies the current node tCurrent and all its child nodes siblings:
- begin = base[tCurrent]
- For all child nodes sibling: location begin + code(sibling) is not occupied
outer: while (true) { pos++;//Whenever a sibling location begin + code(sibling) is occupied, pos adds one if (allocSize <= pos) resize(pos + 1); if (check[pos] != 0) { nonzero_num++; continue; } else if (first == 0) { nextCheckPos = pos; first = 1; } begin = pos - siblings.get(0).getKey(); //Here begin is used to record the base value of tCurrent at this time if (allocSize <= (begin + siblings.get(siblings.size() - 1).getKey())) { // progress can be be zero //Prevent progresses from generating divide-by-zero errors double toSize = Math.max(1.05, 1.0 * keySize / (progress + 1)) * allocSize; int maxSize = (int) (Integer.MAX_VALUE * 0.95); if (allocSize >= maxSize) throw new RuntimeException("Double array trie is too big."); else resize((int) Math.min(toSize, maxSize)); } if (used[begin]) continue; for (int i = 1; i < siblings.size(); i++) if (check[begin + siblings.get(i).getKey()] != 0)//Indicates that the location begin + siblings.get(i).getKey() has been occupied continue outer; break; }
Note: The code uses check[begin + siblings.get(i).getKey()] to determine if the location begin + code(sibling) has been occupied or if it is not zero, to indicate that the location has been occupied
- Once the base value of the parent node tCurrent is found, you can set the check value of all child node siblings to base[tCurrent]:
for (Map.Entry<Integer, State> sibling : siblings) { check[begin + sibling.getKey()] = begin; }
although Double array trie tree Check[child_index] = father_is described Index, but since base is monotonic, check[child_index] can be set directly to base[father_index]
- Next, the child node siblings is handled in two cases:
- If sibling is a tail node, its base value can directly inherit the base value of the parent node
- If sibling is not a tail node, it is added to the siblingQueue, repeating the above procedure, and verifying the base value of the node by checking if its child node index is in conflict
for (Map.Entry<Integer, State> sibling : siblings) { List<Map.Entry<Integer, State>> new_siblings = new ArrayList<Map.Entry<Integer, State>>(sibling.getValue().getSuccess().entrySet().size() + 1); if (fetch(sibling.getValue(), new_siblings) == 0) // Indicates that the current child node sibling is a leaf node { base[begin + sibling.getKey()] = (-sibling.getValue().getLargestValueId() - 1); progress++; } else //For siblings with child nodes, add them to the siblingQueue and repeat the loop above { siblingQueue.add(new AbstractMap.SimpleEntry<Integer, List<Map.Entry<Integer, State>>>(begin + sibling.getKey(), new_siblings)); } sibling.getValue().setIndex(begin + sibling.getKey());//sibling's index is deterministic }
Note: All index es of siblings for tCurrent are known whether or not they are tail nodes. So there is sibling.getValue().setIndex(begin + sibling.getKey()) at the end;
- The final step is to set the base value of tCurrent to begin:
Integer parentBaseIndex = tCurrent.getKey(); if (parentBaseIndex != null) { base[parentBaseIndex] = begin; }
Build fail and output
The search process for fail node follows AC Automation The only difference here is that both fail and output need to be made into arrays.
- fail array:
Assuming index is the node of i and index of fail node is j, then there are:
f
a
i
l
[
i
]
=
j
fail[i] = j
fail[i]=j
public void setFailure(State failState, int fail[]) { this.failure = failState; fail[index] = failState.index; }
- output array:
The output array is used to place all emits for each location node:
o
u
t
p
u
t
[
S
t
a
t
e
.
i
n
d
e
x
]
=
S
t
a
t
e
.
e
m
i
t
s
output[State.index] = State.emits
output[State.index]=State.emits
Since State.emits itself is a one-dimensional array, output is a two-dimensional array.
private void constructOutput(State targetState) { Collection<Integer> emit = targetState.emit(); if (emit == null || emit.size() == 0) return; int[] output = new int[emit.size()]; Iterator<Integer> it = emit.iterator(); for (int i = 0; i < output.length; ++i) { output[i] = it.next(); } AhoCorasickDoubleArrayTrie.this.output[targetState.getIndex()] = output; }
Query of Double Array trie Tree AC Automation
- Text is the text to query. Assuming that the pointer has reached the node currentState and the text has reached the position i, first use getState() to find if there is text(i) in the child node of currentState, and if there is, return the index of this node; If the word is not present in the child node, it is found and returned from the child node of the fail node of the current state:
private int getState(int currentState, char character) { int newCurrentState = transitionWithRoot(currentState, character); // Jump by success first while (newCurrentState == -1) // If the jump fails, press failure to jump { currentState = fail[currentState]; newCurrentState = transitionWithRoot(currentState, character); } return newCurrentState; }
- But now that we store an array instead of a trie tree, looking for the character C in a child node of currentState first finds where the currentState is moved by the character c based on base[currentState] + code(c). Next, we determine the correctness of the transfer by checking whether check[base[currentState] + code(c)] = base[currentState]. If the transfer is correct, return to base[currentState] + code(c) where the transfer was reached
protected int transitionWithRoot(int nodePos, char c) { int b = base[nodePos]; int p; p = b + c + 1;// p is the index of the node that the node at nodePos moves to by the character c if (b != check[p])//Check if the node found from c, whose parent is a node at the nodePos location { if (nodePos == 0) return 0; return -1; } return p; }
- With the index of the transferred node, emits can be indexed to that node using the ouput[] array, where emits are the index of words, where the word hit of the text can be restored using the array v[] where the words are stored:
private void storeEmits(int position, int currentState, List<Hit<V>> collectedEmits) { int[] hitArray = output[currentState];//emits array indexed to current state using output if (hitArray != null) { for (int hit : hitArray) { collectedEmits.add(new Hit<V>(position - l[hit], position, v[hit]));//Restore Words with Array v } } }