Text is a kind of sequence data. An article can be regarded as a sequence of characters or words.
Text preprocessing usually consists of four steps:
- Read in text
- participle
- Build a dictionary to map each word to a unique index
- The text is transformed from the sequence of words to the sequence of index, which is convenient for input model
Read in text
import collections # Packets used in word frequency statistics import re # Packages for regular expressions def read_time_machine(): with open('**File path+file name**.txt', 'r') as f: # Open file lines = [re.sub('[^a-z]+', ' ', line.strip().lower()) for line in f] # Cut the file with space, convert all letters to lowercase and store them in a list return lines # Return to a 2D list lines = read_time_machine() print('# sentences %d' % len(lines)) # View the length of a word
participle
To segment each sentence is to divide a sentence into several words (token s) and transform them into a sequence of words.
def tokenize(sentences, token='word'): # The default is words if token == 'word': # In words return [sentence.split(' ') for sentence in sentences] elif token == 'char': # In letters return [list(sentence) for sentence in sentences] else: print('ERROR: unkown token type '+token) tokens = tokenize(lines) tokens[0:2]
Build a dictionary
To facilitate model processing, you need to convert strings to numbers. So you need to build a vocabulary first, mapping each word to a unique index number.
class Vocab(object): def __init__(self, tokens, min_freq=0, use_special_tokens=False): counter = count_corpus(tokens) # Dictionary of word frequency statistics self.token_freqs = list(counter.items()) # Word frequency list self.idx_to_token = [] # The index is converted to a list of words, because the list has its own index, so it does not need to be stored in a dictionary like a word to index if use_special_tokens: # padding, begin of sentence, end of sentence, unknown self.pad, self.bos, self.eos, self.unk = (0, 1, 2, 3) self.idx_to_token += ['', '', '', ''] else: self.unk = 0 self.idx_to_token += [''] self.idx_to_token += [token for token, freq in self.token_freqs if freq >= min_freq and token not in self.idx_to_token] # Set the word frequency threshold (min_freq). If the word frequency is greater than or equal to the threshold and the word has not appeared in the original list, the word will be saved in the list # The position of the word in the list is the index of the word, which is the mapping from index to word self.token_to_idx = dict() # Dictionary of words to index for idx, token in enumerate(self.idx_to_token): # Key for word index for value self.token_to_idx[token] = idx def __len__(self): # Returns the length of a feature property return len(self.idx_to_token) def __getitem__(self, tokens): # Word to index mapping function if not isinstance(tokens, (list, tuple)): return self.token_to_idx.get(tokens, self.unk) return [self.__getitem__(token) for token in tokens] def to_tokens(self, indices): # Index to word mapping function if not isinstance(indices, (list, tuple)): return self.idx_to_token[indices] return [self.idx_to_token[index] for index in indices] def count_corpus(sentences): # Word frequency statistical function # sentences is a two-dimensional list. Each dimension stores all the words in each sentence tokens = [tk for st in sentences for tk in st] # Put all words in a one-dimensional list return collections.Counter(tokens) # Returns a dictionary that records the number of occurrences of each word
Here we try to build a dictionary with "self imported files" as corpus
vocab = Vocab(tokens) # View word to index results print(list(vocab.token_to_idx.items())[0:10])
Turn words into indexes
Using a dictionary, you can convert sentences in the original text from a word sequence to an index sequence
# Select two to view the mapping for i in range(8, 10): print('words:', tokens[i]) print('indices:', vocab[tokens[i]])
Segmentation with existing tools
The former word segmentation method has at least the following disadvantages:
- Punctuation usually provides semantic information, but this method discards it directly
- Words like "shouldn't", "doesn't" can be mishandled
- Words such as "Mr.", "Dr.", etc. will be mishandled
These problems can be solved by introducing more complex rules, but in fact, there are some existing tools that can segment words well, such as spaCy and NLTK.
Examples of these two tools:
text = "Mr. Chen doesn't agree with my suggestion." # Using spacy import spacy nlp = spacy.load('en_core_web_sm') # Choose English: 'en core web SM' doc = nlp(text) print([token.text for token in doc]) # Using nltk from nltk.tokenize import word_tokenize from nltk import data data.path.append('File path') print(word_tokenize(text))