With keyword matching, you are a qualified algorithm engineer

Keywords: Database

During the construction of the medical Q & a system these two days, I met with interesting algorithms, and I felt that the skeleton was quite surprised, so I recorded it. The purpose of this paper is to build an actree in the medical field, and to query the pre-processing of some characters in the neo4j database. At this time, use find() to feel that you are not an algorithm engineer, which is too inefficient, and you will find that when processing thousands of tasks at the same time, there will be a cpu bottleneck. If we use ahocorasick to achieve, it can effectively reduce the consumption of cpu.

Some codes of the original project are as follows (due to the confidentiality of the project, some codes are shown here)

def build_actree(self, wordlist):
  """
        //Construct actree to accelerate filtering
        :param wordlist:
        :return:
        """
  actree = ahocorasick.Automaton()
  # Add words to the tree
  for index, word in enumerate(wordlist):
    actree.add_word(word, (index, word))
    actree.make_automaton()
    return actree
  
# Construction domain actree
self.disease_tree = self.build_actree(list(set(self.disease_entities)))
self.alias_tree = self.build_actree(list(set(self.alias_entities)))
self.symptom_tree = self.build_actree(list(set(self.symptom_entities)))
self.complication_tree = self.build_actree(list(set(self.complication_entities)))

self.symptom_qwds = ['What symptoms', 'What are the symptoms', 'What are the symptoms', 'What are the symptoms', 'What token', 'What are the representations', 'What is representation',
                     'What phenomenon', 'What are the phenomena', 'What are the phenomena', 'symptom', 'What is the performance', 'What is the performance', 'What are the performances',
                     'What behavior', 'What behaviors', 'What are the behaviors', 'What's the situation', 'What is the situation', 'What are the conditions', 'What is the phenomenon',
                     'What is the performance', 'What is behavior']  # Ask for symptoms
self.cureway_qwds = ['drug', 'drugs', 'Medication', 'capsule', 'oral liquid', 'Inflamed tablet', 'What medicine to take', 'What kind of medicine to use', 'What should I do?',
                     'What medicine to buy', 'How to treat', 'How to treat', 'How to treat', 'How to treat', 'How to treat', 'How to treat',
                     'Method of treatment', 'therapy', 'How to treat', 'how', 'How to treat', 'therapeutic method']  # Ask for treatment

Get to know AC

AC automata is a classical data structure of multi pattern matching. Its principle is to construct fail pointer like KMP, but AC automata is constructed on Trie tree, but its principle is the same. I won't talk about the principle of AC here. Interested friends can look for themselves.

Aho Corasick algorithm is called AC algorithm for short. By preprocessing the pattern string to determine the finite state automata, the scanning text can be completed once. Its complexity is O(n), that is, it has nothing to do with the number and length of pattern strings; its equivalent is Wu Manber algorithm (proposed by Dr. Wu Sheng and UdiManber).

Main ideas

The main idea of AC algorithm is to construct a finite state automaton, which can match the pattern string according to the input according to the finite state automaton. The finite state automatic opportunity will undergo state transition with the input of characters. There are three states of transition as follows:

  1. success state, that is, AC automata can reach directly according to the input state (no jump);

  2. Failure state, that is, the state that the AC automaton does not directly arrive at according to the input, at this time, it will jump to another path (for example, the AC root node is all failure states of its first child)

  3. output status, i.e. successfully matched to an input segment

Step logic

The above three stages correspond to three steps in the algorithm respectively:

  1. To build Pattern tree is to build automata, which is simply to build a "tree" according to the input string;

  2. Establish the failure state, that is, add the failure state to each leaf node (not needed for the root node), that is, mark the current input string to the current leaf node, if it cannot continue to match the path that can jump;

  3. Compare text, that is, when the output state is reached successfully, it represents a successful match.

Visual example

Now let's feel it directly through the code

import ahocorasick
A = ahocorasick.Automaton()
for idx, key in enumerate('I often read articles in WeChat official account dragon club.'.split()):
    A.add_word(key, (idx, key))
    
sentence = 'I love to read the official account of WeChat public number dragon club in Beijing Tiananmen.'
for end_index, (insert_order, original_value) in A.iter(sentence):
    start_index = end_index - len(original_value) + 1
    print((start_index, end_index, (insert_order, original_value)))
    assert sentence[start_index:start_index + len(original_value)] == original_value

And then you might see that

(0, 0, (0, 'I'))
(4, 4, (2, 'stay'))
(15, 15, (6, 'see'))
(17, 18, (3, 'WeChat'))
(20, 22, (4, 'official account'))
(24, 26, (5, 'Malong Society'))
(30, 31, (7, 'article'))

Official installation connection: https://pypi.org/project/pyahocorasick/

You can click here to the full version of the above code: code

Posted by Arsench on Tue, 23 Jun 2020 22:37:37 -0700