Python self study notes: using character level features to enhance the LSTM part of speech annotator

Keywords: Python network

Python self study notes (1)


Python self study notes (1): using character level features to enhance the LSTM part of speech annotator

Recently, I started to study Python systematically, and I am going to write a series of blog s (about 5 articles) to record my learning process. This first note is written in the Chinese version of pytorch's official course Sequence model and long short sentence memory (LSTM) model The network is very simple, but it's really the first model I implemented with Python. It's worth writing a blog to mark it.

1. LSTM in Python

The basic concept and structure of LSTM are not introduced here too much—— A very famous blog ), mainly talking about some points to pay attention to when using the lstm model of Python:

  • The input form of LSTM in Python is a 3D Tensor, and each dimension has important significance. The first dimension is the sequence itself, the second dimension is the instance index in mini batch, and the third dimension is the index of input elements
  • The output of the embedding layer is usually a two-dimensional vector, so if the embedding layer is directly connected to the lstm layer, we need to use the tensor.view() method explicitly modifies the output structure of the embedding layer—— embeds.view(len(sentence), 1, -1)
  • The output format of LSTM is output, (h_n, c_n) The output saves the output h of the last layer and each time step. If it is a bidirectional LSTM, the output h of each time step = [hforward, hforward] (the forward and reverse h of the same time step are connected); and h_n is actually the splicing of the last state (longitudinal) output of each layer, c_n is the splicing of values in the last state memory unit of each layer.

Note: let's talk about this place briefly. I would like to see more in-depth recommendations This article , the diagram inside is very clear. The input and output of LSTM in pytorch is much more complex than expected, including the packed input sequence, which will be discussed later.

2. Part of speech tagging with LSTM

The code of part of Speech Tagging Based on LSTM network is given in the tutorial. The code includes three parts: data preparation, model creation and training model.

2.1 data preparation

def prepare_sequence(seq, to_ix):
    idxs = [to_ix[w] for w in seq]
    return torch.tensor(idxs, dtype=torch.long)

training_data = [
    ("The dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Everybody read that book".split(), ["NN", "V", "DET", "NN"])
word_to_ix = {}
for sent, tags in training_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)
tag_to_ix = {"DET": 0, "NN": 1, "V": 2}

# In practice, larger dimensions such as 32 and 64 are usually used
# Here we use a small dimension, in order to view the weight changes in the training process

2.2 create model

class LSTMTagger(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # LSTM in word_ Embedments as input and output dimension as hidden_ Hidden state value of dim
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)

        # Linear layer maps hidden state space to dimension space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)
        self.hidden = self.init_hidden()

    def init_hidden(self):
        # There is no hidden state at first, so we need to initialize one first
        # Please refer to the relevant Pytoch documents for the design of dimensions
        # The meaning of each dimension is (num_layers, minibatch_size, hidden_dim)
        return (torch.zeros(1, 1, self.hidden_dim),
                torch.zeros(1, 1, self.hidden_dim))

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, self.hidden = self.lstm(
            embeds.view(len(sentence), 1, -1), self.hidden)
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

2.3 model training

model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# View Pre Workout scores
# Note: the value of the output i,j element represents the score of the j label of the word i
# Here we don't need training, we don't need guidance, so we use torch.no_grad()
with torch.no_grad():
    inputs = prepare_sequence(training_data[0][0], word_to_ix)
    tag_scores = model(inputs)

for epoch in range(300):  # In fact, you don't train for 300 cycles. In this case, we just set a random value
    for sentence, tags in training_data:
        # Step 1: remember that Python accumulates gradients
        # We need to clear the gradient before training each instance

        # In addition, the hidden state of LSTM needs to be cleared,
        # Separate it from the history of the last instance
        model.hidden = model.init_hidden()

        # Prepare network input to be Tensor type data for word index
        sentence_in = prepare_sequence(sentence, word_to_ix)
        targets = prepare_sequence(tags, tag_to_ix)

        # Step three: forward propagation
        tag_scores = model(sentence_in)

        # Step 4: calculate the loss and gradient value by calling optimizer.step() to update the gradient
        loss = loss_function(tag_scores, targets)

# View Post Workout scores
with torch.no_grad():
    inputs = prepare_sequence(training_data[0][0], word_to_ix)
    tag_scores = model(inputs)

    # The sentence is "the dog ate the apple", i,j for the word i, label J score
    # We use the label with the highest score as the prediction label. From the output below, we can see that the prediction results in
    # The result to is 0 1 2 0 1. Because the index starts from 0, the first value 0 represents the first row
    # Maximum value, the second value 1 represents the maximum value of the second row, and so on. So the final result is DET
    # NOUN VERB DET NOUN, the whole sequence is correct!

3. Use character level features to enhance LSTM part of speech tagger

The model given in the tutorial only uses word vectors as the input of sequence model, which is equivalent to only considering the characteristics of word level. Character level information such as affixes has a great impact on part of speech. For example, words like affix ly are basically labeled as adverbs. Therefore, we will consider adding character level features of each word to enhance word embedding on the basis of the code just now.
The ideas given in the tutorial are:

  • In the new model, two lstms are needed, one is used to output the score of part of speech tagging, the other is used to obtain the character level expression of each word;
  • To run the sequence model at the character level, you need to use embedded characters as input to the character LSTM.

Therefore, we will set two embedding layers in the model -- character_embeddings and word_embeddings:

 # Word embedding
 self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

 # Character embedding
 self.character_embeddings = nn.Embedding(character_size, character_embedding_dim)

Two LSTM layers character ltsm and tag LSTM:

# lstm_character with the character of each character_embeddings as input, output is the character level feature of the word, and the output dimension is character_ hidden_ Hidden state value of dim
self.lstm_character = nn.LSTM(character_embedding_dim, character_hidden_dim)

# tag_lstm in word_embeddings and the splicing vector of the character level feature of the word are used as input, and the output dimension is hidden_ Hidden state value of dim
self.tag_lstm = nn.LSTM(embedding_dim + character_hidden_dim, hidden_dim)

Where, character_embeddings is used to represent the embedding of character level of each word. This embedding will be input into character ltsm to obtain the characteristics of character level of each word; then, the output of character ltsm layer will be combined with word_ The output word vectors of embeddings layer are spliced, and the spliced results are input into tag LSTM as new word vectors for sequence annotation.

# Word embedding
word_embed = self.word_embeddings(sentence_word)
# Get word character level features
word_character = words_characters[sentence_word.item()]
word_character_in = prepare_sequence(word_character, character_to_ix)
character_embeds = self.character_embeddings(word_character_in)
character_lstm_out, self.hidden_character = self.lstm_character(
    character_embeds.view(len(word_character_in), 1, -1), self.hidden_character)
# Concatenation vector and character level features
embed =, self.hidden_character[0].view(-1)))
# The word vector of each word in the sentence is spliced, and the result after splicing is taken as a tag_ Input of LSTM
embeds =, 1, -1)
lstm_out, self.hidden = self.tag_lstm(embeds, self.hidden)

During model training, you need to clear the hidden state of LSTM and separate it from the history of the last instance

# In addition, the hidden state of LSTM needs to be cleared,
# Separate it from the history of the last instance
model.hidden_tag = model.init_hidden(HIDDEN_DIM)
model.hidden_character = model.init_hidden(CHARACTER_HIDDEN_DIM)

Maybe that's it. I'll get a new one. Welcome to exchange and correct.
Full code link -.-

Posted by ym_chaitu on Sat, 20 Jun 2020 20:20:56 -0700