Python Core Technology and Actual Warfare Object-Oriented Case Analysis

Keywords: Python Programming Database

Today, through the object-oriented case analysis, we mainly simulate the iterative development process in the agile development process, and consolidate the object-oriented programming ideas.

We start with the simplest search and optimize it step by step. First, we need to know the structure of a search engine: searcher, indexer, searcher and user interface. Searcher, as the saying goes, is a crawler. It crawls a lot of content from various websites on the Internet and sends it to the indexer. After the indexer gets the web page and content, it will process the content, form an index, store it in the internal database and wait for retrieval. User interface is the front-end interface of web page and App. Users use the interface to send queries to search engines, parse queries and send them to the retriever. After funny retrieval, the retriever returns the results to users.

Instead of focusing on crawlers, let's assume that the search sample is on a local disk with five files.

# 1.txt
I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today.

# 2.txt
I have a dream that one day down in Alabama, with its vicious racists, . . . one day right there in Alabama little black boys and black girls will be able to join hands with little white boys and white girls as sisters and brothers. I have a dream today.

# 3.txt
I have a dream that one day every valley shall be exalted, every hill and mountain shall be made low, the rough places will be made plain, and the crooked places will be made straight, and the glory of the Lord shall be revealed, and all flesh shall see it together.

# 4.txt
This is our hope. . . With this faith we will be able to hew out of the mountain of despair a stone of hope. With this faith we will be able to transform the jangling discords of our nation into a beautiful symphony of brotherhood. With this faith we will be able to work together, to pray together, to struggle together, to go to jail together, to stand up for freedom together, knowing that we will be free one day. . . .

# 5.txt
And when this happens, and when we allow freedom ring, when we let it ring from every village and every hamlet, from every state and every city, we will be able to speed up that day when all of God's children, black men and white men, Jews and Gentiles, Protestants and Catholics, will be able to join hands and sing in the words of the old Negro spiritual: "Free at last! Free at last! Thank God Almighty, we are free at last!"

Let's first define a base class

class SearchEngineBase(object):
    def __init__(self):
        pass

    def add_corpus(self,file_path):     #Read the contents of the specified file
        with open(file_path,'r') as fin:
            text = fin.read()
        self.process_corpus(file_path,text)
#The following two functions report direct errors if they are not refactored in a subclass
    def process_corpus(self,id,text):
        raise Exception('process_corpus not implemented.')

    def search(self,query):
        raise Exception('search no implemented.')


def main(search_engine):

    #First specify the path to be searched
    for file_path in ['1.txt','2.txt','3.txt','4.txt','5.txt']:
        search_engine.add_corpus(file_path)

    while True:
        query = input(">>>")
        results = search_engine.search(query)print('found {} results(s):'.format(len(results)))
        for result in results:
            print(result)      #File names can only be searched

SearchEngineBase is a base class that can be inherited by engines of different algorithms. Each algorithm can implement two functions, process_corpus() and search(), which correspond to the indexer and searcher mentioned above. The main() function provides a searcher and user interface, so a simple wrapper interface is available.

Let's analyze the code below:

add_corpus() reads the contents of the file, takes the file path as ID, and sends it to process_corpus together with the contents.

process_corpus processes the content, then the file path is ID, and the processed content is saved. The processed content is called index.

search gives a query, processes it, retrieves it by index, and returns it.

Then we make the simplest search engine (as long as it implements functions)

class SimpleEngine(SearchEngineBase):
    def __init__(self):
        super(SimpleEngine,self).__init__()
        self.__id_to_texts = {}

    def process_corpus(self,id,text):
        self.__id_to_texts[id] = text    #Create a dictionary. key=File name, value=File content, pass dictionary to search function

    def search(self,query):             #Violent Retrieval
        results = []
        for id ,text in self.__id_to_texts.items():
            if query in text:    #Traversal dictionary
                results.append(id)
        return results     #When debugging, the return value is forgotten, and the program keeps reporting errors.

search_engine = SimpleEngine()
main(search_engine)

>>>a
found 4 results(s):
1.txt
2.txt
3.txt
5.txt
>>>

output

When we give a character, we have the corresponding output. Let's take a look at it.

SimpleEngine implements a subclass that inherits SearchEngineBase, inherits and implements the process_corpus and search interfaces, and also inherits the add_corpus function (which can also be rewritten), so we can call it directly in main.

Procec_corpus inserts the contents of the file directly into the dictionary. It should be noted that the ID should be unique here, otherwise the same ID will cover the old content.

Search is a direct enumeration of dictionaries from which to find the string to search, and if it can be found, return the ID in the list.

Insert a partition line here and get to know a slightly more complex search engine! The previous initial version is the simplest method, but it is obviously a very inefficient way: after each search, it takes a lot of control, because the search function does not do anything; and every search also takes a lot of time, because all the files in the index library have to be re-searched, if the information of the corpus is searched again. If the quantity is regarded as n, then the time complexity and space complexity should be O(n) level.

Another problem is that query here can only be a word or several connected words. If you want to search for multiple lyrics, and they have different locations scattered in the article, the simple engine in front of you will not work!

The most direct way is to treat corpus segmentation as a word by word, so that only set of all its words is needed for each article. according to Ziff's law In natural language corpus, the frequency of a word's occurrence is inversely proportional to its ranking in the frequency table, showing a power-law distribution. Therefore, corpus segmentation can greatly improve our storage and search efficiency.

import re class BOWEngine(SearchEngineBase): def __init__ (self): super(BOWEngine,self). __init__ () self. __id_to_words = {} def process_corpus(self,id,text): self. __id_to_words [id] = self.parse_text_to_words(text) def search(self,query): query_words = self.parse_text_to_words(query) result = [] for id ,words in self. __id_to_words .items(): if self.query_match(query_words,words): result.append(id) return result @staticmethod def query_match(query_words,words): for query_word in query_words: if query_word not in words: return False return True @staticmethod def parse_text_to_words(text): text = re.sub(r ' [^\w] ' , ' ' ,text) # Use regular expressions to remove punctuation and newline characters text = text.lower() # Convert to lowercase word_list = text.split( ' ' ) # Remove blank words word_list = filter(None,word_list) # Remove blank words return set(word_list) # Return word set search = BOWEngine() main(search)

import re class BOWInvertedIndexEngine(SearchEngineBase): def __init__(self): super(BOWInvertedIndexEngine,self).__init__() self.inverted_index = {} def process_corpus(self,id,text): words = self.parse_text_to_words(text) for word in words: if word not in self.inverted_index: self.inverted_index[word] = [] self.inverted_index[word].append(id) def search(self,query): query_words = list(self.parse_text_to_words(query)) query_words_index = list() for query_word in query_words: query_words_index.append(0) #If the index of a query word is empty, we return it immediately. for query_word in query_words: if query_word not in self.inverted_index: return [] result = [] while True: #First, get all the inverted indexes in the current state index current_ids = [] for idx,query_word in enumerate(query_words): current_index = query_words_index[idx] current_inverted_list = self.inverted_index[query_word] #Having traversed to the end of an inverted index, the end search if current_index >= len(current_inverted_list): return result current_ids.append(current_inverted_list[current_index]) #If current_id All elements are the same, indicating that the word appears in the document corresponding to the element. if all(x == current_ids[0] for x in current_ids): result.append(current_ids[0]) query_words_index = [x+1 for x in query_words_index] continue #If not, add the smallest element to 1 min_val = min(current_ids) min_val_pos = current_ids.index(min_val) query_words_index[min_val_pos] +=1 @staticmethod def parse_text_to_words(text): text = re.sub(r'[^\w]',' ',text) #Use regular expressions to remove punctuation and newline characters text = text.lower() #Convert to lowercase word_list = text.split(' ') #Remove blank words word_list = filter(None,word_list) #Remove blank words return set(word_list) #Return word set search_engine = BOWInvertedIndexEngine() main(search_engine)

import pylru class LRUCache(object): def __init__(self,size=32): self.cache = pylru.lrucache(size) def has(self,key): return key in self.cache def get(self,key): return self.cache[key] def set(self,key,value): self.cache[key] = value class BOWInvertedIndexEngineWithCatch(BOWInvertedIndexEngine,LRUCache): def __init__(self): super(BOWInvertedIndexEngineWithCatch,self).__init__() LRUCache.__init__(self) def search(self,query): if self.has(query): return self.get(query) result = super(BOWInvertedIndexEngineWithCatch,self).search(query) self.set(query,result) return result search_engine = BOWInvertedIndexEngineWithCatch() main(search_engine)

class A(): def __init__(self): self.__a = 'A Private variables a' self.b = 'b' def fun(self): return self.__a #Returns the value of a private variable through a function class B(A): def __init__(self): super().__init__() print(self.b) self.data = self.fun() #Interbound Gets the Value of Private Variables print(self.data) b = B()

Programmer Group

Python Core Technology and Actual Warfare Object-Oriented Case Analysis

Hot Keywords