Using DFA algorithm to realize text filtering

Keywords: Java

I. Introduction to DEA algorithm

DFA is the only good algorithm to implement text filtering.

The full name of DFA is Deterministic Finite Automaton, that is, Deterministic Finite Automaton. It is characterized by a finite set of States and some edges from one state to another. Each edge is marked with a symbol, one of which is the initial state and some of which is the final state. However, unlike the uncertain finite automata, there will not be two edge markers from the same state in DFA with the same symbol.

Simply put, it is to get the next state through event and current state, that is, event + state= nextstate. It is understood that there are multiple nodes in the system, which determine which route to take to another node by passing the incoming event, while the nodes are limited.

II. DEA algorithm practice sensitive word filtering

1. Construction of sensitive Thesaurus

In this paper, two sensitive words, Wang Badan and Wang babangzi, are described. First, a sensitive thesaurus is constructed, which is named sensitive map. The binary tree of these two words is constructed as follows:

The hash table is constructed as follows:

{
    "king":{
        "isEnd":"0",
        "Eight":{
            "Lambs":{
                "son":{
                    "isEnd":"1"
                },
                "isEnd":"0"
            },
            "isEnd":"0",
            "egg":{
                "isEnd":"1"
            }
        }
    }
}

How to implement this data structure with code?

    /**
     * Read the sensitive words, put the sensitive words into the HashSet, and build a DFA algorithm model
     *
     * @param keyWordSet Sensitive Thesaurus
     */
    public Map<String, Object> addSensitiveWordToHashMap(Set<String> keyWordSet) {
        //Initialize sensitive word container to reduce expansion operation
        Map<String, Object> map = new HashMap(Math.max((int) (keyWordSet.size() / .75f) + 1, 16));
        //Iteration keyWordSet
        for (String aKeyWordSet : keyWordSet) {
            Map nowMap = map;
            for (int i = 0; i < aKeyWordSet.length(); i++) {
                //Convert to char
                char keyChar = aKeyWordSet.charAt(i);
                //Obtain
                Object wordMap = nowMap.get(keyChar);
                //If the key exists, assign it directly
                if (wordMap != null) {
                    nowMap = (Map) wordMap;
                } else {     //If not, build a map and set isEnd to 0
                    Map<String, String> newWorMap = new HashMap<>(3);
                    newWorMap.put("isEnd", "0");
                    nowMap.put(keyChar, newWorMap);
                    nowMap = newWorMap;
                }
                //Judge the last
                if (i == aKeyWordSet.length() - 1) {
                    nowMap.put("isEnd", "1");
                }
            }
        }
        return map;
    }

2. Sensitive word filtering

Use the sensitive map constructed by the above example as the sensitive thesaurus for illustration. Suppose that the key words entered here are: Wang Ba is not good, and the flow chart is as follows:

How to implement the flow chart logic with code?

    /**
     * Find if the string contains sensitive characters
     *
     * @param txt String entered
     * @return If it exists, the sensitive string is returned; if it does not exist, the empty string is returned
     */
    public static String findSensitiveWord(String txt) {
        SensitiveWordInit sensitiveWordInit = SpringContextHolder.getBean(SensitiveWordInit.class);
        Map<String, Object> sensitiveWordMap = sensitiveWordInit.getSensitiveWordMap();
        StringBuilder sensitiveWord = new StringBuilder();
        // End marker of sensitive words, indicating that they are matched to the last digit
        boolean flag = false;
        for (int i = 0; i < txt.length(); i++) {
            char word = txt.charAt(i);
            // Get the specified key
            sensitiveWordMap = (Map) sensitiveWordMap.get(word);
            // No, no sensitive words are returned directly
            if (sensitiveWordMap == null) {
                break;
            }
            //Exist, store the sensitive word and judge whether it is the last
            sensitiveWord.append(word);
            //End loop if last match rule
            if ("1".equals(sensitiveWordMap.get("isEnd"))) {
                flag = true;
                break;
            }
        }
        // The expression matches to the whole sensitive word
        if (flag == true) {
            return sensitiveWord.toString();
        }
        return "";
    }

III. optimization ideas

For words like "Wang * Ba & & Dan", meaningless characters are filled in the middle to confuse. When we do sensitive word search, we should also do a meaningless word filter. When we cycle to such meaningless characters, we should skip them to avoid interference.

Posted by Gorf on Tue, 26 Nov 2019 22:40:44 -0800