Python Core Technology and Actual Warfare Object-Oriented Case Analysis

Keywords: Python Programming Database

Today, through the object-oriented case analysis, we mainly simulate the iterative development process in the agile development process, and consolidate the object-oriented programming ideas.

We start with the simplest search and optimize it step by step. First, we need to know the structure of a search engine: searcher, indexer, searcher and user interface. Searcher, as the saying goes, is a crawler. It crawls a lot of content from various websites on the Internet and sends it to the indexer. After the indexer gets the web page and content, it will process the content, form an index, store it in the internal database and wait for retrieval. User interface is the front-end interface of web page and App. Users use the interface to send queries to search engines, parse queries and send them to the retriever. After funny retrieval, the retriever returns the results to users.

Instead of focusing on crawlers, let's assume that the search sample is on a local disk with five files.

# 1.txt 
I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today.

# 2.txt
I have a dream that one day down in Alabama, with its vicious racists, . . . one day right there in Alabama little black boys and black girls will be able to join hands with little white boys and white girls as sisters and brothers. I have a dream today.

# 3.txt
I have a dream that one day every valley shall be exalted, every hill and mountain shall be made low, the rough places will be made plain, and the crooked places will be made straight, and the glory of the Lord shall be revealed, and all flesh shall see it together.

# 4.txt
This is our hope. . . With this faith we will be able to hew out of the mountain of despair a stone of hope. With this faith we will be able to transform the jangling discords of our nation into a beautiful symphony of brotherhood. With this faith we will be able to work together, to pray together, to struggle together, to go to jail together, to stand up for freedom together, knowing that we will be free one day. . . .

# 5.txt
And when this happens, and when we allow freedom ring, when we let it ring from every village and every hamlet, from every state and every city, we will be able to speed up that day when all of God's children, black men and white men, Jews and Gentiles, Protestants and Catholics, will be able to join hands and sing in the words of the old Negro spiritual: "Free at last! Free at last! Thank God Almighty, we are free at last!"

Let's first define a base class

class SearchEngineBase(object):
    def __init__(self):
        pass

    def add_corpus(self,file_path):     #Read the contents of the specified file
        with open(file_path,'r') as fin:
            text = fin.read()
        self.process_corpus(file_path,text)
#The following two functions report direct errors if they are not refactored in a subclass
    def process_corpus(self,id,text):
        raise Exception('process_corpus not implemented.')

    def search(self,query):
        raise Exception('search no implemented.')


def main(search_engine):

    #First specify the path to be searched
    for file_path in ['1.txt','2.txt','3.txt','4.txt','5.txt']:
        search_engine.add_corpus(file_path)

    while True:
        query = input(">>>")
        results = search_engine.search(query)print('found {} results(s):'.format(len(results)))
        for result in results:
            print(result)      #File names can only be searched

SearchEngineBase is a base class that can be inherited by engines of different algorithms. Each algorithm can implement two functions, process_corpus() and search(), which correspond to the indexer and searcher mentioned above. The main() function provides a searcher and user interface, so a simple wrapper interface is available.

Let's analyze the code below:

add_corpus() reads the contents of the file, takes the file path as ID, and sends it to process_corpus together with the contents.

process_corpus processes the content, then the file path is ID, and the processed content is saved. The processed content is called index.

search gives a query, processes it, retrieves it by index, and returns it.

Then we make the simplest search engine (as long as it implements functions)

class SimpleEngine(SearchEngineBase):
    def __init__(self):
        super(SimpleEngine,self).__init__()
        self.__id_to_texts = {}

    def process_corpus(self,id,text):
        self.__id_to_texts[id] = text    #Create a dictionary. key=File name, value=File content, pass dictionary to search function

    def search(self,query):             #Violent Retrieval
        results = []
        for id ,text in self.__id_to_texts.items():
            if query in text:    #Traversal dictionary
                results.append(id)
        return results     #When debugging, the return value is forgotten, and the program keeps reporting errors.

search_engine = SimpleEngine()
main(search_engine)
>>>a
found 4 results(s):
1.txt
2.txt
3.txt
5.txt
>>>
output

When we give a character, we have the corresponding output. Let's take a look at it.

SimpleEngine implements a subclass that inherits SearchEngineBase, inherits and implements the process_corpus and search interfaces, and also inherits the add_corpus function (which can also be rewritten), so we can call it directly in main.

In our new constructor

super(SimpleEngine,self).__init__()   #Functions and attributes inheriting parent classes
self.__id_to_texts = {}               #Initialize new properties

The newly initialized dictionary is used to store file names to file contents.

Procec_corpus inserts the contents of the file directly into the dictionary. It should be noted that the ID should be unique here, otherwise the same ID will cover the old content.

Search is a direct enumeration of dictionaries from which to find the string to search, and if it can be found, return the ID in the list.

Insert a partition line here and get to know a slightly more complex search engine! The previous initial version is the simplest method, but it is obviously a very inefficient way: after each search, it takes a lot of control, because the search function does not do anything; and every search also takes a lot of time, because all the files in the index library have to be re-searched, if the information of the corpus is searched again. If the quantity is regarded as n, then the time complexity and space complexity should be O(n) level.

Another problem is that query here can only be a word or several connected words. If you want to search for multiple lyrics, and they have different locations scattered in the article, the simple engine in front of you will not work!

The most direct way is to treat corpus segmentation as a word by word, so that only set of all its words is needed for each article. according to Ziff's law In natural language corpus, the frequency of a word's occurrence is inversely proportional to its ranking in the frequency table, showing a power-law distribution. Therefore, corpus segmentation can greatly improve our storage and search efficiency.

First, we implement a Bag of Words search model (word bag model).

import re
 class BOWEngine(SearchEngineBase):
     def  __init__ (self):
        super(BOWEngine,self). __init__ ()
        self. __id_to_words = {}

    def process_corpus(self,id,text):
        self. __id_to_words [id] = self.parse_text_to_words(text)

    def search(self,query):
        query_words = self.parse_text_to_words(query)
        result = []
         for id ,words in self. __id_to_words .items():
             if self.query_match(query_words,words):
                result.append(id)
        return result

    @staticmethod
    def query_match(query_words,words):
         for query_word in query_words:
             if query_word not  in words:
                 return False
         return True

    @staticmethod
    def parse_text_to_words(text):
        text = re.sub(r ' [^\w] ' , '  ' ,text)          # Use regular expressions to remove punctuation and newline characters 
        text = text.lower()                        # Convert to lowercase 
        word_list = text.split( '  ' )                # Remove blank words 
        word_list = filter(None,word_list)         # Remove blank words
        return set(word_list)                      # Return word set 
search = BOWEngine()
main(search)
>>>will to join
found 2 results(s):
2.txt
5.txt
>>>will Free god
found 1 results(s):
5.txt
>>>
Operation conclusion

Here we first understand a concept, BOW Model (Bag of Words Model), which is one of the most common and simplest models in NPL field. Assuming that a text, regardless of grammar, syntax, paragraph, or the order in which words appear, only regards this text as a collection of these words. Accordingly, we replace id_to_texts with id_to_words, so that only these words need to be saved, not all articles, and no consideration of the order.

Among them, the process_corpus() function calls the class static method parse_text_to_words, breaks the article into word bags, puts it into set, and then puts it into the dictionary.

The search() function is slightly more complicated. Let's assume that all the results we want to search are in the same article. Then we break query into a set and check every word in the set and every article in the index to see if the word we want to find is in it. The static function query_match is responsible for the process. . These two functions are static functions, not involving the private attributes of the object, the same input can get exactly the same output results. So set to static, it can be used by other classes easily.

However, it still needs to traverse all ID s for each query. Although the Simple model has saved a lot of time, the cost of traversing hundreds of millions of pages on the Internet is still higher. So how to optimize it? It can be seen that every query we query will not have a lot of words, generally only a few, at most a dozen, can we start from here? Moreover, the word bag model does not consider the order between them, but some people want words to appear in order, or search for words in the text closer, in this case, the word bag model is powerless! How do we optimize these two points? Here's the code

import re
class BOWInvertedIndexEngine(SearchEngineBase):
    def __init__(self):
        super(BOWInvertedIndexEngine,self).__init__()
        self.inverted_index = {}

    def process_corpus(self,id,text):
        words = self.parse_text_to_words(text)
        for word in words:
            if word not in self.inverted_index:
                self.inverted_index[word] = []
            self.inverted_index[word].append(id)

    def search(self,query):
        query_words = list(self.parse_text_to_words(query))
        query_words_index = list()
        for query_word in query_words:
            query_words_index.append(0)

       #If the index of a query word is empty, we return it immediately.
        for query_word in query_words:
            if query_word not in self.inverted_index:
                return []

        result = []
        while True:
            #First, get all the inverted indexes in the current state index
            current_ids = []
            for idx,query_word in enumerate(query_words):
                current_index = query_words_index[idx]
                current_inverted_list = self.inverted_index[query_word]

                #Having traversed to the end of an inverted index, the end search
                if current_index >= len(current_inverted_list):
                    return result
                current_ids.append(current_inverted_list[current_index])

            #If current_id All elements are the same, indicating that the word appears in the document corresponding to the element.
            if all(x == current_ids[0] for x in current_ids):
                result.append(current_ids[0])
                query_words_index = [x+1 for x in query_words_index]
                continue

            #If not, add the smallest element to 1
            min_val = min(current_ids)
            min_val_pos = current_ids.index(min_val)
            query_words_index[min_val_pos] +=1

    @staticmethod
    def parse_text_to_words(text):
        text = re.sub(r'[^\w]',' ',text)         #Use regular expressions to remove punctuation and newline characters
        text = text.lower()                       #Convert to lowercase
        word_list = text.split(' ')               #Remove blank words
        word_list = filter(None,word_list)        #Remove blank words
        return set(word_list)                     #Return word set


search_engine = BOWInvertedIndexEngine()
main(search_engine)

First of all, the code is relatively straightforward. This time, the algorithm does not need to be fully understood, but with this example to explain how object-oriented programming isolates the complexity of the algorithm while retaining other interfaces unchanged. From this code, we can see that the new model continues to use the previous interface, but still only in the _init_ (), process_corpus() and search() three functions to modify.

This is also a way of teamwork in large companies. After a reasonable hierarchical design, the logic of each layer only needs to deal with its own affairs. In the iteration upgrade of our search engine kernel, there is no change in main function, user interface.

Continuing with the code, we noticed the Inverted Index at the beginning. This is a new model, Inverted Index Model, that is, inverted index. This is a very famous search engine method.

Inverse indexing means that this time, in turn, it is stored in a dictionary in the form of word - > ID. So when search ing, we just need to pick up the inverted indexes of query_word we want separately, and then find the common elements from these lists, those common elements, namely ID, are the results of the query we want. This avoids the embarrassment of filtering all indexes once.

The search() function is to get all the inverted indexes according to query_words. If not, it means that a little query word is not in any article and returns to empty directly. After getting it, run an algorithm of merging K ordered arrays to get the ID we want. The algorithm used here is not yet optimal, the best way to write is to get a minimum heap to store the index. If you are interested in it, you can get to know it. It's not detailed here.

The second question is, what if we want the search words to appear sequentially, or if we want the search words to be closer to each other in the text?

We need to reserve the location information of words for each article on Inverted Index, so that we can do some processing in the merger operation.

Finally, let's talk about LRU and multiple inheritance.

At this point, our search engine can be online, but with more and more visits (QPS), the server is a bit overwhelmed, after a period of time, we found that a lot of repetitive search occupies more than 90% of the traffic, so we decided to add a big killer for this search engine - slow. Deposit.

import pylru
class LRUCache(object):
    def __init__(self,size=32):
        self.cache = pylru.lrucache(size)
    def has(self,key):
        return key in self.cache

    def get(self,key):
        return self.cache[key]

    def set(self,key,value):
        self.cache[key] = value

class BOWInvertedIndexEngineWithCatch(BOWInvertedIndexEngine,LRUCache):
    def __init__(self):
        super(BOWInvertedIndexEngineWithCatch,self).__init__()
        LRUCache.__init__(self)

    def search(self,query):
        if self.has(query):
            return self.get(query)

        result = super(BOWInvertedIndexEngineWithCatch,self).search(query)
        self.set(query,result)

        return result
search_engine = BOWInvertedIndexEngineWithCatch()
main(search_engine)

We started defining a cache through LRUCache and invoking its methods by inheriting the class. LRU cache is a very classic kind of cache. For simplicity, we call pylru package directly. It conforms to the local principle of nature. It can retain the recently used objects and gradually eliminate the objects that have not been used for a long time. So in the search function, we first use has() to determine whether it is in the cache, if we call get() to get it directly, if we do not search again, we return the result and send it to the cache.

The BOWInvertedIndexEngineWithCatch class inherits two classes from our multiple inheritance method. There are two points to note about the initialization method of multiple inheritance

The first is to initialize the first parent class of the class directly with the following code

super(BOWInvertedIndexEngineWithCatch,self).__init__()

However, this approach requires the top-level parent of the inheritance chain to inherit object s.

Here's an interlude. I remember that Python 3 seems to be unnecessary (referring to classical and new classes, you can search for them), and you can remove the class name and write like this.

super().__init__()

Second, for multiple inheritance, if there are multiple constructors that need to be called, we must use traditional methods to call constructors with various classes.

LRUCache.__init__(self)

Secondly, we can call the function of the parent class forcibly. We have reconstructed the search function in the subclass, but we also want to call the search function of the parent class, so we can call it forcibly in the following way.

result = super(BOWInvertedIndexEngineWithCatch,self).search(query)

Finally, leave a question: Can private variables be inherited?

class A():
    def __init__(self):
        self.__a = 'A Private variables a'
        self.b = 'b'
    def fun(self):   
        return self.__a  #Returns the value of a private variable through a function

class B(A):
    def __init__(self):
        super().__init__()
        print(self.b)
        self.data = self.fun()   #Interbound Gets the Value of Private Variables
        print(self.data)

b = B()

That's all right!

Posted by MitchEvans on Tue, 06 Aug 2019 22:54:22 -0700