Datawhale team learning punch camp task 4: text preprocessing

Text is a kind of sequence data. An article can be regarded as a sequence of characters or words.
Text preprocessing usually consists of four steps:
  1. Read in text
  2. participle
  3. Build a dictionary to map each word to a unique index
  4. The text is transformed from the sequence of words to the sequence of index, which is convenient for input model
Read in text
import collections    # Packets used in word frequency statistics
import re			  # Packages for regular expressions

def read_time_machine():
    with open('**File path+file name**.txt', 'r') as f:   # Open file
        lines = [re.sub('[^a-z]+', ' ', line.strip().lower()) for line in f]
        # Cut the file with space, convert all letters to lowercase and store them in a list
    return lines  # Return to a 2D list

lines = read_time_machine()
print('# sentences %d' % len(lines))   # View the length of a word
participle

To segment each sentence is to divide a sentence into several words (token s) and transform them into a sequence of words.

def tokenize(sentences, token='word'):		# The default is words
    if token == 'word':  					# In words
        return [sentence.split(' ') for sentence in sentences]
    elif token == 'char':					# In letters
        return [list(sentence) for sentence in sentences]
    else:
        print('ERROR: unkown token type '+token)

tokens = tokenize(lines)
tokens[0:2]
Build a dictionary

To facilitate model processing, you need to convert strings to numbers. So you need to build a vocabulary first, mapping each word to a unique index number.

class Vocab(object):
    def __init__(self, tokens, min_freq=0, use_special_tokens=False):
        counter = count_corpus(tokens)  		   # Dictionary of word frequency statistics
        self.token_freqs = list(counter.items())   # Word frequency list
        self.idx_to_token = []   				   # The index is converted to a list of words, because the list has its own index, so it does not need to be stored in a dictionary like a word to index
        if use_special_tokens:
            # padding, begin of sentence, end of sentence, unknown
            self.pad, self.bos, self.eos, self.unk = (0, 1, 2, 3)
            self.idx_to_token += ['', '', '', '']
        else:
            self.unk = 0
            self.idx_to_token += ['']
            
        self.idx_to_token += [token for token, freq in self.token_freqs
                        if freq >= min_freq and token not in self.idx_to_token]
        # Set the word frequency threshold (min_freq). If the word frequency is greater than or equal to the threshold and the word has not appeared in the original list, the word will be saved in the list
        # The position of the word in the list is the index of the word, which is the mapping from index to word

        self.token_to_idx = dict() 		# Dictionary of words to index 
        for idx, token in enumerate(self.idx_to_token):
        	# Key for word index for value
            self.token_to_idx[token] = idx

    def __len__(self):
    	# Returns the length of a feature property
        return len(self.idx_to_token)

    def __getitem__(self, tokens):
    	# Word to index mapping function
        if not isinstance(tokens, (list, tuple)):
            return self.token_to_idx.get(tokens, self.unk)
        return [self.__getitem__(token) for token in tokens]

    def to_tokens(self, indices):
    	# Index to word mapping function
        if not isinstance(indices, (list, tuple)):
            return self.idx_to_token[indices]
        return [self.idx_to_token[index] for index in indices]

def count_corpus(sentences):
    # Word frequency statistical function
	# sentences is a two-dimensional list. Each dimension stores all the words in each sentence
    tokens = [tk for st in sentences for tk in st]  # Put all words in a one-dimensional list
    return collections.Counter(tokens)  			# Returns a dictionary that records the number of occurrences of each word

Here we try to build a dictionary with "self imported files" as corpus

vocab = Vocab(tokens)
# View word to index results
print(list(vocab.token_to_idx.items())[0:10])
Turn words into indexes

Using a dictionary, you can convert sentences in the original text from a word sequence to an index sequence

# Select two to view the mapping
for i in range(8, 10):
    print('words:', tokens[i])
    print('indices:', vocab[tokens[i]])
Segmentation with existing tools

The former word segmentation method has at least the following disadvantages:

  1. Punctuation usually provides semantic information, but this method discards it directly
  2. Words like "shouldn't", "doesn't" can be mishandled
  3. Words such as "Mr.", "Dr.", etc. will be mishandled

These problems can be solved by introducing more complex rules, but in fact, there are some existing tools that can segment words well, such as spaCy and NLTK.

Examples of these two tools:

text = "Mr. Chen doesn't agree with my suggestion."

# Using spacy
import spacy
nlp = spacy.load('en_core_web_sm')			# Choose English: 'en core web SM'
doc = nlp(text)
print([token.text for token in doc])

# Using nltk
from nltk.tokenize import word_tokenize
from nltk import data
data.path.append('File path')
print(word_tokenize(text))
Published 5 original articles, won praise 0, visited 4
Private letter follow

Posted by gfoot on Fri, 14 Feb 2020 02:22:25 -0800