I. Introduction to DEA algorithm
DFA is the only good algorithm to implement text filtering.
The full name of DFA is Deterministic Finite Automaton, that is, Deterministic Finite Automaton. It is characterized by a finite set of States and some edges from one state to another. Each edge is marked with a symbol, one of which is the initial state and some of which is the final state. However, unlike the uncertain finite automata, there will not be two edge markers from the same state in DFA with the same symbol.
Simply put, it is to get the next state through event and current state, that is, event + state= nextstate. It is understood that there are multiple nodes in the system, which determine which route to take to another node by passing the incoming event, while the nodes are limited.
II. DEA algorithm practice sensitive word filtering
1. Construction of sensitive Thesaurus
In this paper, two sensitive words, Wang Badan and Wang babangzi, are described. First, a sensitive thesaurus is constructed, which is named sensitive map. The binary tree of these two words is constructed as follows:
The hash table is constructed as follows:
{ "king":{ "isEnd":"0", "Eight":{ "Lambs":{ "son":{ "isEnd":"1" }, "isEnd":"0" }, "isEnd":"0", "egg":{ "isEnd":"1" } } } }
How to implement this data structure with code?
/** * Read the sensitive words, put the sensitive words into the HashSet, and build a DFA algorithm model * * @param keyWordSet Sensitive Thesaurus */ public Map<String, Object> addSensitiveWordToHashMap(Set<String> keyWordSet) { //Initialize sensitive word container to reduce expansion operation Map<String, Object> map = new HashMap(Math.max((int) (keyWordSet.size() / .75f) + 1, 16)); //Iteration keyWordSet for (String aKeyWordSet : keyWordSet) { Map nowMap = map; for (int i = 0; i < aKeyWordSet.length(); i++) { //Convert to char char keyChar = aKeyWordSet.charAt(i); //Obtain Object wordMap = nowMap.get(keyChar); //If the key exists, assign it directly if (wordMap != null) { nowMap = (Map) wordMap; } else { //If not, build a map and set isEnd to 0 Map<String, String> newWorMap = new HashMap<>(3); newWorMap.put("isEnd", "0"); nowMap.put(keyChar, newWorMap); nowMap = newWorMap; } //Judge the last if (i == aKeyWordSet.length() - 1) { nowMap.put("isEnd", "1"); } } } return map; }
2. Sensitive word filtering
Use the sensitive map constructed by the above example as the sensitive thesaurus for illustration. Suppose that the key words entered here are: Wang Ba is not good, and the flow chart is as follows:
How to implement the flow chart logic with code?
/** * Find if the string contains sensitive characters * * @param txt String entered * @return If it exists, the sensitive string is returned; if it does not exist, the empty string is returned */ public static String findSensitiveWord(String txt) { SensitiveWordInit sensitiveWordInit = SpringContextHolder.getBean(SensitiveWordInit.class); Map<String, Object> sensitiveWordMap = sensitiveWordInit.getSensitiveWordMap(); StringBuilder sensitiveWord = new StringBuilder(); // End marker of sensitive words, indicating that they are matched to the last digit boolean flag = false; for (int i = 0; i < txt.length(); i++) { char word = txt.charAt(i); // Get the specified key sensitiveWordMap = (Map) sensitiveWordMap.get(word); // No, no sensitive words are returned directly if (sensitiveWordMap == null) { break; } //Exist, store the sensitive word and judge whether it is the last sensitiveWord.append(word); //End loop if last match rule if ("1".equals(sensitiveWordMap.get("isEnd"))) { flag = true; break; } } // The expression matches to the whole sensitive word if (flag == true) { return sensitiveWord.toString(); } return ""; }
III. optimization ideas
For words like "Wang * Ba & & Dan", meaningless characters are filled in the middle to confuse. When we do sensitive word search, we should also do a meaningless word filter. When we cycle to such meaningless characters, we should skip them to avoid interference.