Serverless Actual: 3 minutes for text-sensitive word filtering

Keywords: github encoding JSON Python

;

Sensitive word filtering is a technical means to prevent cyber crime and cyber violence developed with the development of Internet community. By screening and screening the possible keywords of cyber crime or cyber violence, we can prevent them from happening in the future and stifle the serious consequences of crime in the budding.

With the popularity of various social platforms, sensitive word filtering has gradually become a very important and important function.So what are the new implementations of Sensitive Word Filtering in Serverless with Python?Can we implement an API for sensitive word filtering in the simplest way?

Understanding Several Methods of Sensitive Filtering

Replace method

If we say sensitive word filtering, it's not really text replacement. Take Python for example, when it comes to word replacement, we have to think of replace. We can prepare a sensitive lexicon and replace it with replace:

def worldFilter(keywords, text):
    for eve in keywords:
        text = text.replace(eve, "***")
    return text
keywords = ("Keyword 1", "Keyword 2", "Keyword 3")
content = "This is an example of keyword substitution, involving keyword 1, keyword 2, and finally keyword 3."
print(worldFilter(keywords, content))

But if you think about it, you will find that this method has serious performance problems when the text and sensitive lexicon are very large.For example, I'll modify the code to perform basic performance tests:

import time

def worldFilter(keywords, text):
    for eve in keywords:
        text = text.replace(eve, "***")
    return text
keywords =[ "Key word" + str(i) for i in range(0,10000)]
content = "This is an example of keyword substitution, involving keyword 1, keyword 2, and finally keyword 3." * 1000
startTime = time.time()
worldFilter(keywords, content)
print(time.time()-startTime)

The output at this point is 0.12426114082336426, and you can see that the performance is very poor.

Regular expression

Rather than using replace, it is faster to regularly express re.sub.

import time
import re
def worldFilter(keywords, text):
     return re.sub("|".join(keywords), "***", text)
keywords =[ "Key word" + str(i) for i in range(0,10000)]
content = "This is an example of keyword substitution, involving keyword 1, keyword 2, and finally keyword 3." * 1000
startTime = time.time()
worldFilter(keywords, content)
print(time.time()-startTime)

We also added performance tests, following the above method of transformation testing, the output result is 0.24773502349853516.Through such an example, we can find that its performance is not great, but as the amount of text increases, the performance of regular expression will be much higher.

DFA Filter Sensitive Words

This method is relatively more efficient.For example, if we think bad people, bad children and bad people are sensitive words, their tree relationships can be expressed:

Expressed in a DFA dictionary:

{
    'bad': {
        'egg': {
            '\x00': 0
        }, 
        'people': {
            '\x00': 0
        }, 
        'Child': {
            'son': {
                '\x00': 0
            }
        }
    }
}

The best benefit of using this tree to represent a problem is to reduce the number of retrievals, improve the efficiency of retrieval, and implement basic code:

import time

class DFAFilter(object):
    def __init__(self):
        self.keyword_chains = {}  # Keyword Chain List
        self.delimit = '\x00'  # limit

    def parse(self, path):
        with open(path, encoding='utf-8') as f:
            for keyword in f:
                chars = str(keyword).strip().lower()  # Keyword English becomes lowercase
                if not chars:  # Return directly if the keyword is empty
                    return
                level = self.keyword_chains
                for i in range(len(chars)):
                    if chars[i] in level:
                        level = level[chars[i]]
                    else:
                        if not isinstance(level, dict):
                            break
                        for j in range(i, len(chars)):
                            level[chars[j]] = {}
                            last_level, last_char = level, chars[j]
                            level = level[chars[j]]
                        last_level[last_char] = {self.delimit: 0}
                        break
                if i == len(chars) - 1:
                    level[self.delimit] = 0

    def filter(self, message, repl="*"):
        message = message.lower()
        ret = []
        start = 0
        while start < len(message):
            level = self.keyword_chains
            step_ins = 0
            for char in message[start:]:
                if char in level:
                    step_ins += 1
                    if self.delimit not in level[char]:
                        level = level[char]
                    else:
                        ret.append(repl * step_ins)
                        start += step_ins - 1
                        break
                else:
                    ret.append(message[start])
                    break
            else:
                ret.append(message[start])
            start += 1

        return ''.join(ret)



gfw = DFAFilter()
gfw.parse( "./sensitive_words")
content = "This is an example of keyword substitution, involving keyword 1, keyword 2, and finally keyword 3." * 1000
startTime = time.time()
result = gfw.filter(content)
print(time.time()-startTime)

Here our dictionary library is:

with open("./sensitive_words", 'w') as f:
    f.write("\n".join( [ "Key word" + str(i) for i in range(0,10000)]))

Execution results:

0.06450581550598145

You can see further performance improvements.

AC Automation Filter Sensitive Words Algorithm

Next, let's take a look at the AC Automation Filter Sensitive Words algorithm:

AC Automation: A common example is to give n words and then an article containing m characters to find out how many words have appeared in the article.

Simply put, an AC automaton is a dictionary tree + kmp algorithm + mismatched pointer

Code implementation:

import time
class Node(object):
    def __init__(self):
        self.next = {}
        self.fail = None
        self.isWord = False
        self.word = ""


class AcAutomation(object):

    def __init__(self):
        self.root = Node()

    # Find Sensitive Word Function
    def search(self, content):
        p = self.root
        result = []
        currentposition = 0

        while currentposition < len(content):
            word = content[currentposition]
            while word in p.next == False and p != self.root:
                p = p.fail

            if word in p.next:
                p = p.next[word]
            else:
                p = self.root

            if p.isWord:
                result.append(p.word)
                p = self.root
            currentposition += 1
        return result

    # Load Sensitive Lexicon Function
    def parse(self, path):
        with open(path, encoding='utf-8') as f:
            for keyword in f:
                temp_root = self.root
                for char in str(keyword).strip():
                    if char not in temp_root.next:
                        temp_root.next[char] = Node()
                    temp_root = temp_root.next[char]
                temp_root.isWord = True
                temp_root.word = str(keyword).strip()

    # Sensitive Word Substitution Function
    def wordsFilter(self, text):
        """
        :param ah: AC automata
        :param text: text
        :return: Filter text after sensitive words
        """
        result = list(set(self.search(text)))
        for x in result:
            m = text.replace(x, '*' * len(x))
            text = m
        return text


acAutomation = AcAutomation()
acAutomation.parse('./sensitive_words')
startTime = time.time()
print(acAutomation.wordsFilter("This is an example of keyword substitution, involving keyword 1, keyword 2, and finally keyword 3."*1000))
print(time.time()-startTime)

The lexicon is the same:

with open("./sensitive_words", 'w') as f:
    f.write("\n".join( [ "Key word" + str(i) for i in range(0,10000)]))

Using the above method, the test result is 0.017391204833984375.

Summary of Sensitive Word Filtering Methods

You can see that among all the above basic algorithms, DFA filter sensitive words has the highest performance, but in fact, no one is better for the latter two algorithms. Perhaps sometimes, AC automatic filter sensitive words algorithm will get better performance, so in the production life, it is recommended to use both algorithms, which can be done according to your specific business needs.

Implement Sensitive Word Filtering API

Deploying code to the Serverless architecture allows you to choose an API gateway to combine with function calculation, taking the AC Automation Filter Sensitive Words algorithm as an example: We only need to add a few lines of code, the complete code is as follows:

# -*- coding:utf-8 -*-

import json, uuid


class Node(object):
    def __init__(self):
        self.next = {}
        self.fail = None
        self.isWord = False
        self.word = ""


class AcAutomation(object):

    def __init__(self):
        self.root = Node()

    # Find Sensitive Word Function
    def search(self, content):
        p = self.root
        result = []
        currentposition = 0

        while currentposition < len(content):
            word = content[currentposition]
            while word in p.next == False and p != self.root:
                p = p.fail

            if word in p.next:
                p = p.next[word]
            else:
                p = self.root

            if p.isWord:
                result.append(p.word)
                p = self.root
            currentposition += 1
        return result

    # Load Sensitive Lexicon Function
    def parse(self, path):
        with open(path, encoding='utf-8') as f:
            for keyword in f:
                temp_root = self.root
                for char in str(keyword).strip():
                    if char not in temp_root.next:
                        temp_root.next[char] = Node()
                    temp_root = temp_root.next[char]
                temp_root.isWord = True
                temp_root.word = str(keyword).strip()

    # Sensitive Word Substitution Function
    def wordsFilter(self, text):
        """
        :param ah: AC automata
        :param text: text
        :return: Filter text after sensitive words
        """
        result = list(set(self.search(text)))
        for x in result:
            m = text.replace(x, '*' * len(x))
            text = m
        return text


def response(msg, error=False):
    return_data = {
        "uuid": str(uuid.uuid1()),
        "error": error,
        "message": msg
    }
    print(return_data)
    return return_data


acAutomation = AcAutomation()
path = './sensitive_words'
acAutomation.parse(path)


def main_handler(event, context):
    try:
        sourceContent = json.loads(event["body"])["content"]
        return response({
            "sourceContent": sourceContent,
            "filtedContent": acAutomation.wordsFilter(sourceContent)
        })
    except Exception as e:
        return response(str(e), True)

Finally, to facilitate local testing, we can add:

def test():
    event = {
        "requestContext": {
            "serviceId": "service-f94sy04v",
            "path": "/test/{path}",
            "httpMethod": "POST",
            "requestId": "c6af9ac6-7b61-11e6-9a41-93e8deadbeef",
            "identity": {
                "secretId": "abdcdxxxxxxxsdfs"
            },
            "sourceIp": "14.17.22.34",
            "stage": "release"
        },
        "headers": {
            "Accept-Language": "en-US,en,cn",
            "Accept": "text/html,application/xml,application/json",
            "Host": "service-3ei3tii4-251000691.ap-guangzhou.apigateway.myqloud.com",
            "User-Agent": "User Agent String"
        },
        "body": "{\"content\":\"This is a test text, so I will.\"}",
        "pathParameters": {
            "path": "value"
        },
        "queryStringParameters": {
            "foo": "bar"
        },
        "headerParameters": {
            "Refer": "10.0.2.14"
        },
        "stageVariables": {
            "stage": "release"
        },
        "path": "/test/value",
        "queryString": {
            "foo": "bar",
            "bob": "alice"
        },
        "httpMethod": "POST"
    }
    print(main_handler(event, None))


if __name__ == "__main__":
    test()

Once we've finished, we can test and run it, for example, my dictionary is:

Ha-ha
 test

Results after execution:

{'uuid': '9961ae2a-5cfc-11ea-a7c2-acde48001122', 'error': False, 'message': {'sourceContent': 'This is a test text, so I will.', 'filtedContent': 'This is a**Text, so do I**Yes'}}

Next, we deploy the code to the cloud and create a new serverless.yaml:

sensitive_word_filtering:
  component: "@serverless/tencent-scf"
  inputs:
    name: sensitive_word_filtering
    codeUri: ./
    exclude:
      - .gitignore
      - .git/**
      - .serverless
      - .env
    handler: index.main_handler
    runtime: Python3.6
    region: ap-beijing
    description: Sensitive Word Filtering
    memorySize: 64
    timeout: 2
    events:
      - apigw:
          name: serverless
          parameters:
            environment: release
            endpoints:
              - path: /sensitive_word_filtering
                description: Sensitive Word Filtering
                method: POST
                enableCORS: true
                param:
                  - name: content
                    position: BODY
                    required: 'FALSE'
                    type: string
                    desc: Sentences to be filtered

Then deploy through sls --debug, deploying the result:

Finally, test with PostMan:

summary

Sensitive word filtering is a very common requirement/technology at present. By using sensitive word filtering, we can reduce the appearance of malicious or illegal speech to a certain extent. In the above practice process, there are two points:

  • For the sensitive lexicon size issue: There are many on Github, you can search and download by yourself, because there are many sensitive words in the sensitive lexicon, so I can not put them directly on this for everyone to use, so you also need to search and use them on Github by yourself;
  • Problem with this API usage scenario: It can be completely placed in our community posting system/commenting system/blog publishing system to prevent sensitive vocabulary and reduce unnecessary hassles.

Serverless Framework 30-day Trial Plan

We invite you to experience the most convenient way to develop and deploy Serverless.During the trial period, the associated products and services provide free resources and professional technical support to help your business achieve Serverless quickly and easily!

Details are available: Serverless Framework Trial Plan

One More Thing

What can you do in 3 seconds?Have a drink of water, see an email, or - deploy a complete Serverless Apply?

Copy Link to PC Browser Access: https://serverless.cloud.tencent.com/deploy/express

Fast deployment in 3 seconds for the fastest ever experience Serverless HTTP Actual development!

Port:

Welcome to: Serverless Chinese Network , you can Best Practices Experience more development of Serverless apps here!

Recommended reading: Serverless Architecture: From Principles, Design to Project Practice

Posted by Molarmite on Tue, 12 May 2020 09:49:08 -0700