DL based on Python Day2 multi-layer perceptron

Keywords: Database less

1. Text preprocessing

Text is a kind of sequence data. An article can be regarded as a sequence of characters or words. Text preprocessing usually includes four steps:
1. Read in text
2. participle
3. Establish a dictionary to map each word to a unique index
4. Transform the text from the sequence of words to the sequence of indexes to facilitate the input of models

Read in text

We use an English novel, Time Machine of H. G. Well, as an example to show the specific process of text preprocessing.

import collections
import re

def read_time_machine():
    with open('/home/kesci/input/timemachine7163/timemachine.txt', 'r') as f:
        lines = [re.sub('[^a-z]+', ' ', line.strip().lower()) for line in f]
    return lines

lines = read_time_machine()
print('# sentences %d' % len(lines))
#sentences 3221


Each sentence is divided into several words (token s) and converted into a sequence of words.

def tokenize(sentences, token='word'):
    """Split sentences into word or char tokens"""
    if token == 'word':
        return [sentence.split(' ') for sentence in sentences]
    elif token == 'char':
        return [list(sentence) for sentence in sentences]
        print('ERROR: unkown token type '+token)

tokens = tokenize(lines)
#[['the', 'time', 'machine', 'by', 'h', 'g', 'wells', ''], ['']]

Build a dictionary

To facilitate model processing, we need to convert strings to numbers. So we need to build a vocabulary first, mapping each word to a unique index number.

class Vocab(object):
    def __init__(self, tokens, min_freq=0, use_special_tokens=False):
    		'''tokens: One word'''
    		'''min_freq: For the minimum value of the occurrence frequency of a word in the language database, if it is less than this, it should be deleted'''
    		'''use_special_tokens: Do you need special token'''
        counter = count_corpus(tokens)  # : 
        self.token_freqs = list(counter.items())
        self.idx_to_token = []
        if use_special_tokens:
			//When training the model, the data set to be imported is a two-dimensional matrix, which should be ensured,
			//The number of elements in each row is the same. Use the padding method to fill this row with the same number of elements in other rows
			//begin of sentence / end of sentence
			//Sometimes you need to mark the beginning and end of a sentence to indicate the beginning and end of a sentence
			//There are some words that haven't appeared in the prediction database. They are new words. The term is "unlisted words", which is expressed by unknown
            # padding, begin of sentence, end of sentence, unknown
            self.pad, self.bos, self.eos, self.unk = (0, 1, 2, 3)
            self.idx_to_token += ['', '', '', '']
            self.unk = 0
            self.idx_to_token += ['']
        self.idx_to_token += [token for token, freq in self.token_freqs
                        if freq >= min_freq and token not in self.idx_to_token]
        self.token_to_idx = dict()
        for idx, token in enumerate(self.idx_to_token):
            self.token_to_idx[token] = idx

    def __len__(self):
        return len(self.idx_to_token)

    def __getitem__(self, tokens):
        '''Word to index mapping'''
        if not isinstance(tokens, (list, tuple)):
            return self.token_to_idx.get(tokens, self.unk)
        return [self.__getitem__(token) for token in tokens]

    def to_tokens(self, indices):
        '''Given index return word'''
        if not isinstance(indices, (list, tuple)):
            return self.idx_to_token[indices]
        return [self.idx_to_token[index] for index in indices]

def count_corpus(sentences):
    tokens = [tk for st in sentences for tk in st]
    return collections.Counter(tokens)  # Returns a dictionary that records the number of occurrences of each word
#vocab = Vocab(tokens)
#[('', 0), ('the', 1), ('time', 2), ('machine', 3), ('by', 4), ('h', 5), ('g', 6), ('wells', 7), ('i', 8), ('traveller', 9)]

Turn words into indexes

Using dictionaries, we can convert sentences in the original text from word sequences to index sequences

for i in range(8, 10):
    print('words:', tokens[i])
    print('indices:', vocab[tokens[i]])
#words: ['the', 'time', 'traveller', 'for', 'so', 'it', 'will', 'be', 'convenient', 'to', 'speak', 'of', 'him', '']
#indices: [1, 2, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 0]
#words: ['was', 'expounding', 'a', 'recondite', 'matter', 'to', 'us', 'his', 'grey', 'eyes', 'shone', 'and']
#indices: [20, 21, 22, 23, 24, 16, 25, 26, 27, 28, 29, 30]

2. Segmentation with tools


import spacy
nlp = spacy.load('en_core_web_sm')
text = "Mr. Chen doesn't agree with my suggestion."
doc = nlp(text)
print([token.text for token in doc])
['Mr.', 'Chen', 'does', "n't", 'agree', 'with', 'my', 'suggestion', '.']


from nltk.tokenize import word_tokenize
from nltk import data
text = "Mr. Chen doesn't agree with my suggestion."
['Mr.', 'Chen', 'does', "n't", 'agree', 'with', 'my', 'suggestion', '.']
Published 4 original articles, won praise 0, visited 10
Private letter follow

Posted by X74SY on Fri, 14 Feb 2020 06:30:27 -0800