Pytorch implements Word2Vec

Keywords: network

How to use skip-gram structure to implement Word2Vec algorithm in PyTorch? Embedding words used in natural language processing into concepts. Word embedding is very useful for machine translation.

Word Embedding

When dealing with words in text, thousands of word categories need to be analyzed; each word in the vocabulary corresponds to a category. It is very inefficient to encode these words by monotherm because most of the values in the monotherm vector will be zero. If the single heat input vector is multiplied by a matrix with the first hidden layer, a hidden output vector with multiple values of 0 will be generated. To solve this problem and improve the efficiency of the network, the embedded function will be used. Embedding is actually a fully connected layer, just like the level you've seen before. The hierarchy is called the embedding layer and the weight is called the embedding weight. The step of multiplying with the embedding layer is skipped and the hidden layer value is obtained directly from the weight matrix. This is because the result of multiplying the uniheat vector with the matrix is the matrix row corresponding to the index of the "open" input unit.

Word2Vec

Word2Vec algorithm finds the vectors representing words to get a more efficient representation. These vectors also contain semantic information about words. Words that appear in similar contexts will have vectors that are close to each other, such as "coffee", "tea" and "water". The vectors of different words are farther away from each other, and the distance in vector space can represent the relationship between words. Skp-gram structure and negative sampling will be used, because skip-gram is better than CBOW and the training speed of negative sampling is faster. For skip-gram structure, a word is passed in and an attempt is made to predict its contextual words in the text. In this way, we can train the network to learn the representation of words in similar contexts.

Loading data

Load the data and place it in the data directory.

# read in the extracted text file      
with open('data/text8/text8') as f:
    text = f.read()

# print out the first 100 characters
print(text[:100])

Pretreatment

Pre-processing text makes the training process more convenient. The preprocess function in the utils.py file performs the following operations:

  • All punctuation points are converted to tags, so "." becomes <PERIOD>. Although this data set has no punctuation, this step is useful for other NLP problems.
  • Delete words that appear no more than five times in the data set. This can significantly reduce the problems caused by data noise and improve the quality of vector representation.
  • Returns a list of words in the text.

utils.py 

import re
from collections import Counter

def preprocess(text):

    # Replace punctuation with tokens so we can use them in our model
    text = text.lower()
    text = text.replace('.', ' <PERIOD> ')
    text = text.replace(',', ' <COMMA> ')
    text = text.replace('"', ' <QUOTATION_MARK> ')
    text = text.replace(';', ' <SEMICOLON> ')
    text = text.replace('!', ' <EXCLAMATION_MARK> ')
    text = text.replace('?', ' <QUESTION_MARK> ')
    text = text.replace('(', ' <LEFT_PAREN> ')
    text = text.replace(')', ' <RIGHT_PAREN> ')
    text = text.replace('--', ' <HYPHENS> ')
    text = text.replace('?', ' <QUESTION_MARK> ')
    # text = text.replace('\n', ' <NEW_LINE> ')
    text = text.replace(':', ' <COLON> ')
    words = text.split()
    
    # Remove all words with  5 or fewer occurences
    word_counts = Counter(words)
    trimmed_words = [word for word in words if word_counts[word] > 5]

    return trimmed_words


def create_lookup_tables(words):
    """
    Create lookup tables for vocabulary
    :param words: Input list of words
    :return: Two dictionaries, vocab_to_int, int_to_vocab
    """
    word_counts = Counter(words)
    # sorting the words from most to least frequent in text occurrence
    sorted_vocab = sorted(word_counts, key=word_counts.get, reverse=True)
    # create int_to_vocab dictionaries
    int_to_vocab = {ii: word for ii, word in enumerate(sorted_vocab)}
    vocab_to_int = {word: ii for ii, word in int_to_vocab.items()}

    return vocab_to_int, int_to_vocab

import utils

# get list of words
words = utils.preprocess(text)
print(words[:30])
# print some stats about this word data
print("Total words in text: {}".format(len(words)))
print("Unique words: {}".format(len(set(words)))) # `set` removes any duplicate words

Dictionaries

Two dictionaries will be created, one converting words to integers and the other converting integers to words. Also, use a function in the utils.py file to complete this step. The input parameter of create_lookup_tables is a list of text words and returns two dictionaries.

  • Distributing integers in descending order of frequency, the most common word "the" corresponds to an integer of 0, the second common word is 1, and so on.

After the dictionary is created, the words are converted to integers and stored in the int_words list.

vocab_to_int, int_to_vocab = utils.create_lookup_tables(words)
int_words = [vocab_to_int[word] for word in words]

print(int_words[:30])

Secondary sampling

Frequent words such as "the", "of" and "for" do not provide much contextual information for nearby words. If some common words are discarded, some noise points in the data can be eliminated, and the training speed and the quality of the representation can be improved. Mikolov calls this process secondary sampling. For each word in the training set, the word will be discarded according to a certain probability. The formula is as follows:

                                                                            

Among them, t is the threshold parameter, and () is the frequency of words in the total data set.

Words in int_words were sampled twice. That is to access int_words and discard each word according to the probability shown above. Note that _ (_) denotes the probability of discarding a word. The secondary sampling data is assigned to train_words.

from collections import Counter
import random
import numpy as np

threshold = 1e-5
word_counts = Counter(int_words)
#print(list(word_counts.items())[0])  # dictionary of int_words, how many times they appear

total_count = len(int_words)
freqs = {word: count/total_count for word, count in word_counts.items()}
p_drop = {word: 1 - np.sqrt(threshold/freqs[word]) for word in word_counts}
# discard some frequent words, according to the subsampling equation
# create a new list of words for training
train_words = [word for word in int_words if random.random() < (1 - p_drop[word])]

print(train_words[:30])
print(len(Counter(train_words)))

Create batches

Once the data is ready, it needs to be batched before it can be transferred to the network. When using skip-gram structure, for each word in the text, you need to define a context window (size is_) and then get all the words in the window.

def get_target(words, idx, window_size=5):
    ''' Get a list of words in a window around an index. '''
    
    R = np.random.randint(1, window_size+1)
    start = idx - R if (idx - R) > 0 else 0
    stop = idx + R
    target_words = words[start:idx] + words[idx+1:stop+1]
    
    return list(target_words)
# test your code!

# run this cell multiple times to check for random window selection
int_text = [i for i in range(10)]
print('Input: ', int_text)
idx=5 # word index of interest

target = get_target(int_text, idx=idx, window_size=5)
print('Target: ', target)  # you should get some indices around the idx

Generate batch data

The following generator function will use the above get_target function to return multiple batches of input and target data. It retrieves the word batch_size from the list of words. For each batch of data, it retrieves the target context words in the window.

def get_batches(words, batch_size, window_size=5):
    ''' Create a generator of word batches as a tuple (inputs, targets) '''
    
    n_batches = len(words)//batch_size
    
    # only full batches
    words = words[:n_batches*batch_size]
    
    for idx in range(0, len(words), batch_size):
        x, y = [], []
        batch = words[idx:idx+batch_size]
        for ii in range(len(batch)):
            batch_x = batch[ii]
            batch_y = get_target(batch, ii, window_size)
            y.extend(batch_y)
            x.extend([batch_x]*len(batch_y))
        yield x, y
    
int_text = [i for i in range(20)]
x,y = next(get_batches(int_text, batch_size=4, window_size=5))

print('x\n', x)
print('y\n', y)

Verification

Next, create a function that will observe the model in the process of model learning. Some common and unusual words will be selected. Then use the similarity cosine to output the closest word. We use the embedded table to represent the validated words as vectors, and then calculate the similarity between the vectors of each word in the embedded table. After calculating the similarity, we will output validation words and words with similar semantics in the embedded table. This allows us to check whether the embedded table combines words with similar semantics.

def cosine_similarity(embedding, valid_size=16, valid_window=100, device='cpu'):
    """ Returns the cosine similarity of validation words with words in the embedding matrix.
        Here, embedding should be a PyTorch embedding module.
    """
    
    # Here we're calculating the cosine similarity between some random words and 
    # our embedding vectors. With the similarities, we can look at what words are
    # close to our random words.
    
    # sim = (a . b) / |a||b|
    
    embed_vectors = embedding.weight
    
    # magnitude of embedding vectors, |b|
    magnitudes = embed_vectors.pow(2).sum(dim=1).sqrt().unsqueeze(0)
    
    # pick N words from our ranges (0,window) and (1000,1000+window). lower id implies more frequent 
    valid_examples = np.array(random.sample(range(valid_window), valid_size//2))
    valid_examples = np.append(valid_examples,
                               random.sample(range(1000,1000+valid_window), valid_size//2))
    valid_examples = torch.LongTensor(valid_examples).to(device)
    
    valid_vectors = embedding(valid_examples)
    similarities = torch.mm(valid_vectors, embed_vectors.t())/magnitudes
        
    return valid_examples, similarities

Negative sampling

For each sample provided to the network, we use the output of the software max level to train the sample. This means that for each input, we will make minor adjustments to millions of weights, although there is only one real sample. This leads to the network training efficiency is very low. We can approximate the loss at the softmax level by updating only a small portion of the weight at a time. We will update the weights of the correct samples, but only the weights of a few incorrect samples. This process is called negative sampling.

We need to make two corrections: first, because we don't need to get the soft max output of all words, we only care about one output word at a time. Just like using an embedded table to map input words to the hidden layer, we can now use another embedded table to map the hidden layer to output words. Now we will have two embedding layers, one is the input word embedding layer, the other is the output word embedding layer. Secondly, we will modify the loss function because we only care about real samples and a small number of noise samples.

This loss function is somewhat complicated. It is the embedding vector of the target word "output" (the transposed vector, that is, the meaning of the symbol). It is the embedding vector of the word "input". The meaning of the first item is

The log-sigmoid function is run on the inner product of the output word vector and the input word vector. For the second item, let's look at it first.

It means the sum of words extracted from the distribution of noise points. Noise distribution refers to a vocabulary that is not in the context of the input word. In fact, we can randomly extract words from the vocabulary to obtain these noisy words. The _ is an arbitrary probability distribution, so we can decide how to set the weight of the extracted words. It can be a uniform distribution, that is, the probability of extracting all words is the same. It can also be sampled according to the frequency at which each word appears in the text corpus. According to the author's practice, the best distribution is 3/4.

Finally, in the following sections

We will run the log-sigmoid function for the negative result of the inner product of the noise vector and the input vector.

import torch
from torch import nn
import torch.optim as optim
class SkipGramNeg(nn.Module):
    def __init__(self, n_vocab, n_embed, noise_dist=None):
        super().__init__()
        
        self.n_vocab = n_vocab
        self.n_embed = n_embed
        self.noise_dist = noise_dist
        
        # define embedding layers for input and output words
        self.in_embed = nn.Embedding(n_vocab,n_embed)
        self.out_embed = nn.Embedding(n_vocab,n_embed)
        
        # Initialize both embedding tables with uniform distribution
        
    def forward_input(self, input_words):
        # return input vector embeddings
        input_vectors = self.in_embed(input_words)
        return input_vectors
    
    def forward_output(self, output_words):
        # return output vector embeddings
        output_vectors = self.out_embed(output_words)
        return output_vectors
    
    def forward_noise(self, batch_size, n_samples):
        """ Generate noise vectors with shape (batch_size, n_samples, n_embed)"""
        if self.noise_dist is None:
            # Sample words uniformly
            noise_dist = torch.ones(self.n_vocab)
        else:
            noise_dist = self.noise_dist
            
        # Sample words from our noise distribution
        noise_words = torch.multinomial(noise_dist,
                                        batch_size * n_samples,
                                        replacement=True)
        
        device = "cuda" if model.out_embed.weight.is_cuda else "cpu"
        noise_words = noise_words.to(device)
        
        ## TODO: get the noise embeddings
        # reshape the embeddings so that they have dims (batch_size, n_samples, n_embed)
        
        noise_vectors = self.out_embed(noise_words).view(batch_size,n_sample,self.n_embed)
        
        return noise_vectors
class NegativeSamplingLoss(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, input_vectors, output_vectors, noise_vectors):
        
        batch_size, embed_size = input_vectors.shape
        
        # Input vectors should be a batch of column vectors
        input_vectors = input_vectors.view(batch_size, embed_size, 1)
        
        # Output vectors should be a batch of row vectors
        output_vectors = output_vectors.view(batch_size, 1, embed_size)
        
        # bmm = batch matrix multiplication
        # correct log-sigmoid loss
        out_loss = torch.bmm(output_vectors, input_vectors).sigmoid().log()
        out_loss = out_loss.squeeze()
        
        # incorrect log-sigmoid loss
        noise_loss = torch.bmm(noise_vectors.neg(), input_vectors).sigmoid().log()
        noise_loss = noise_loss.squeeze().sum(1)  # sum the losses over the sample of noise vectors

        # negate and sum correct and noisy log-sigmoid losses
        # return average batch loss
        return -(out_loss + noise_loss).mean()

train

The following is the training cycle. If there is a GPU device, it is recommended to train the model on the GPU device.

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Get our noise distribution
# Using word frequencies calculated earlier in the notebook
word_freqs = np.array(sorted(freqs.values(), reverse=True))
unigram_dist = word_freqs/word_freqs.sum()
noise_dist = torch.from_numpy(unigram_dist**(0.75)/np.sum(unigram_dist**(0.75)))

# instantiating the model
embedding_dim = 300
model = SkipGramNeg(len(vocab_to_int), embedding_dim, noise_dist=noise_dist).to(device)

# using the loss that we defined
criterion = NegativeSamplingLoss() 
optimizer = optim.Adam(model.parameters(), lr=0.003)

print_every = 1500
steps = 0
epochs = 5

# train for some number of epochs
for e in range(epochs):
    
    # get our input, target batches
    for input_words, target_words in get_batches(train_words, 512):
        steps += 1
        inputs, targets = torch.LongTensor(input_words), torch.LongTensor(target_words)
        inputs, targets = inputs.to(device), targets.to(device)
        
        # input, outpt, and noise vectors
        input_vectors = model.forward_input(inputs)
        output_vectors = model.forward_output(targets)
        noise_vectors = model.forward_noise(inputs.shape[0], 5)

        # negative sampling loss
        loss = criterion(input_vectors, output_vectors, noise_vectors)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # loss stats
        if steps % print_every == 0:
            print("Epoch: {}/{}".format(e+1, epochs))
            print("Loss: ", loss.item()) # avg batch loss at this point in training
            valid_examples, valid_similarities = cosine_similarity(model.in_embed, device=device)
            _, closest_idxs = valid_similarities.topk(6)

            valid_examples, closest_idxs = valid_examples.to('cpu'), closest_idxs.to('cpu')
            for ii, valid_idx in enumerate(valid_examples):
                closest_words = [int_to_vocab[idx.item()] for idx in closest_idxs[ii]][1:]
                print(int_to_vocab[valid_idx.item()] + " | " + ', '.join(closest_words))
            print("...\n")

Visual Word Vector

Next, we will use T-SNE to visualize high-dimensional word vector clustering. T-SNE can project these vectors into two-dimensional space while retaining local structure.

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
# getting embeddings from the embedding layer of our model, by name
embeddings = model.in_embed.weight.to('cpu').data.numpy()
viz_words = 380
tsne = TSNE()
embed_tsne = tsne.fit_transform(embeddings[:viz_words, :])
fig, ax = plt.subplots(figsize=(16, 16))
for idx in range(viz_words):
    plt.scatter(*embed_tsne[idx, :], color='steelblue')
    plt.annotate(int_to_vocab[idx], (embed_tsne[idx, 0], embed_tsne[idx, 1]), alpha=0.7)

 

Posted by kemper on Thu, 08 Aug 2019 04:56:49 -0700