1. Text preprocessing
Text is a kind of sequence data. An article can be regarded as a sequence of characters or words. Text preprocessing usually includes four steps:
1. Read in text
2. participle
3. Establish a dictionary to map each word to a unique index
4. Transform the text from the sequence of words to the sequence of indexes to facilitate the input of models
Read in text
We use an English novel, Time Machine of H. G. Well, as an example to show the specific process of text preprocessing.
import collections import re def read_time_machine(): with open('/home/kesci/input/timemachine7163/timemachine.txt', 'r') as f: lines = [re.sub('[^a-z]+', ' ', line.strip().lower()) for line in f] return lines lines = read_time_machine() print('# sentences %d' % len(lines)) #sentences 3221
participle
Each sentence is divided into several words (token s) and converted into a sequence of words.
def tokenize(sentences, token='word'): """Split sentences into word or char tokens""" if token == 'word': return [sentence.split(' ') for sentence in sentences] elif token == 'char': return [list(sentence) for sentence in sentences] else: print('ERROR: unkown token type '+token) tokens = tokenize(lines) tokens[0:2] #Out: #[['the', 'time', 'machine', 'by', 'h', 'g', 'wells', ''], ['']]
Build a dictionary
To facilitate model processing, we need to convert strings to numbers. So we need to build a vocabulary first, mapping each word to a unique index number.
class Vocab(object): def __init__(self, tokens, min_freq=0, use_special_tokens=False): '''tokens: One word''' '''min_freq: For the minimum value of the occurrence frequency of a word in the language database, if it is less than this, it should be deleted''' '''use_special_tokens: Do you need special token''' counter = count_corpus(tokens) # : self.token_freqs = list(counter.items()) self.idx_to_token = [] if use_special_tokens: /// //padding: //When training the model, the data set to be imported is a two-dimensional matrix, which should be ensured, //The number of elements in each row is the same. Use the padding method to fill this row with the same number of elements in other rows //begin of sentence / end of sentence //Sometimes you need to mark the beginning and end of a sentence to indicate the beginning and end of a sentence //unknown //There are some words that haven't appeared in the prediction database. They are new words. The term is "unlisted words", which is expressed by unknown # padding, begin of sentence, end of sentence, unknown self.pad, self.bos, self.eos, self.unk = (0, 1, 2, 3) self.idx_to_token += ['', '', '', ''] else: self.unk = 0 self.idx_to_token += [''] self.idx_to_token += [token for token, freq in self.token_freqs if freq >= min_freq and token not in self.idx_to_token] self.token_to_idx = dict() for idx, token in enumerate(self.idx_to_token): self.token_to_idx[token] = idx def __len__(self): return len(self.idx_to_token) def __getitem__(self, tokens): '''Word to index mapping''' if not isinstance(tokens, (list, tuple)): return self.token_to_idx.get(tokens, self.unk) return [self.__getitem__(token) for token in tokens] def to_tokens(self, indices): '''Given index return word''' if not isinstance(indices, (list, tuple)): return self.idx_to_token[indices] return [self.idx_to_token[index] for index in indices] def count_corpus(sentences): tokens = [tk for st in sentences for tk in st] return collections.Counter(tokens) # Returns a dictionary that records the number of occurrences of each word ####Example: #vocab = Vocab(tokens) #print(list(vocab.token_to_idx.items())[0:10]) #[('', 0), ('the', 1), ('time', 2), ('machine', 3), ('by', 4), ('h', 5), ('g', 6), ('wells', 7), ('i', 8), ('traveller', 9)] ####
Turn words into indexes
Using dictionaries, we can convert sentences in the original text from word sequences to index sequences
for i in range(8, 10): print('words:', tokens[i]) print('indices:', vocab[tokens[i]]) #words: ['the', 'time', 'traveller', 'for', 'so', 'it', 'will', 'be', 'convenient', 'to', 'speak', 'of', 'him', ''] #indices: [1, 2, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 0] #words: ['was', 'expounding', 'a', 'recondite', 'matter', 'to', 'us', 'his', 'grey', 'eyes', 'shone', 'and'] #indices: [20, 21, 22, 23, 24, 16, 25, 26, 27, 28, 29, 30]
2. Segmentation with tools
spaCy:
spaCy: import spacy nlp = spacy.load('en_core_web_sm') text = "Mr. Chen doesn't agree with my suggestion." doc = nlp(text) print([token.text for token in doc]) ['Mr.', 'Chen', 'does', "n't", 'agree', 'with', 'my', 'suggestion', '.']
NLTK:
from nltk.tokenize import word_tokenize from nltk import data data.path.append('/home/kesci/input/nltk_data3784/nltk_data') text = "Mr. Chen doesn't agree with my suggestion." print(word_tokenize(text)) ['Mr.', 'Chen', 'does', "n't", 'agree', 'with', 'my', 'suggestion', '.']