Keras implements English to Chinese machine translation seq2seq+LSTM

Keywords: network github encoding Lambda

The model implements the translation from English to Chinese. In order to better show the figure of model architecture borrowing Tycoon (embeddings are not used here):

The complete code of this article: Github

catalog

1, Processing text data

1. Get sentences before and after translation

2. Create a dictionary about character index and index character

3. One hot coding for Chinese and English sentences

2, Model building

3. Coder predicts each character

IV. training model

5, Show

 

The whole system is composed of encoder and decoder. Each part has an LSTM network, in which encoder inputs the original sentence, decoder inputs the translated sentence with the start symbol, and the output is the target sentence with the end symbol Ford.

The specific steps are as follows:

1.encoder encodes the input sequence into state vector

2.decoder predicts from the first character

3. Feed the state vector to the decoder_ h,state_ c) And cumulatively contain the unique hot code of the previously predicted characters (the first state vector comes from encoder, and then the state vector comes from decoder and predict when predicting each character of each target sequence)

4. Use argmax to predict the position of the next character, and then find the corresponding character according to the dictionary

5. Add the characters in the previous step to the target sequence

6. End the loop until the prediction reaches the end character we specify

1, Processing text data

This step includes the segmentation of the original data to obtain the pre-translational and post translational sentences, the generation of character dictionary, and the one hot encoding of the pre-translational and post translational sentences to facilitate the processing of data.

1. Get sentences before and after translation

First, let's look at the style of the original data:

First import the required Library:

Code 1.1.1

import pandas as pd
import numpy as np
from keras.layers import Input, LSTM, Dense, merge,concatenate
from keras.optimizers import Adam, SGD
from keras.models import Model,load_model
from keras.utils import plot_model
from keras.models import Sequential

#Defining parameters of neural network
NUM_SAMPLES=3000   #Size of training sample
batch_size = 64    #Number of samples selected in one training
epochs = 100       #Number of training rounds
latent_dim = 256   #Number of units of LSTM

Read the file with pandas, and then we just need the first two columns

Code 1.1.2

data_path='data/cmn.txt'
df=pd.read_table(data_path,header=None).iloc[:NUM_SAMPLES,0:2]
#Add title block
df.columns=['inputs','targets']
#Raise your hand and add '\ t' as the start mark and '\ n' as the end mark in each Chinese sentence
df['targets']=df['targets'].apply(lambda x:'\t'+x+'\n')

Finally, it is in the form of:

 

Then the data in English and Chinese are transformed into list form respectively

Code 1.1.3

#Get English and Chinese lists
input_texts=df.inputs.values.tolist()
target_texts=df.targets.values.tolist()

#Determine the characters in Chinese and English. df.unique() take sum directly to splice each sentence in the unique array into a long sentence
input_characters = sorted(list(set(df.inputs.unique().sum())))
target_characters = sorted(list(set(df.targets.unique().sum())))

#The number of different characters in English characters
num_encoder_tokens = len(input_characters)
#Number of different characters in Chinese characters
num_decoder_tokens = len(target_characters)
#Maximum input length
INUPT_LENGTH = max([ len(txt) for txt in input_texts])
#Maximum output length
OUTPUT_LENGTH = max([ len(txt) for txt in target_texts])

2. Create a dictionary about character index and index character

Code 1.2.1

input_token_index = dict( [(char, i)for i, char in enumerate(input_characters)] )
target_token_index = dict( [(char, i) for i, char in enumerate(target_characters)] )

reverse_input_char_index = dict([(i, char) for i, char in enumerate(input_characters)])
reverse_target_char_index = dict([(i, char) for i, char in enumerate(target_characters)])

3. One hot coding for Chinese and English sentences

Code 1.3.1

#Each corpus needs to be converted into three-dimensional data input required by LSTM [n_ Samples, timestamp, one hot feature] into the model
encoder_input_data =np.zeros((NUM_SAMPLES,INUPT_LENGTH,num_encoder_tokens))
decoder_input_data =np.zeros((NUM_SAMPLES,OUTPUT_LENGTH,num_decoder_tokens))
decoder_target_data  = np.zeros((NUM_SAMPLES,OUTPUT_LENGTH,num_decoder_tokens))

for i,(input_text,target_text) in enumerate(zip(input_texts,target_texts)):
    for t,char in enumerate(input_text):
        encoder_input_data[i,t,input_token_index[char]]=1.0

    for t, char in enumerate(target_text):
        decoder_input_data[i,t,target_token_index[char]]=1.0

        if t > 0:
            # decoder_target_data does not contain the start character, and is better than decoder_input_data one step ahead
            decoder_target_data[i, t-1, target_token_index[char]] = 1.0

2, Model building

Code 2.1

    #Define encoder input
    encoder_inputs=Input(shape=(None,num_encoder_tokens))
    #Define LSTM layer, latent_dim is the number of neurons in each gate of the LSTM unit, return_ When state is set to True, the last state h,c will be returned
    encoder=LSTM(latent_dim,return_state=True)
    # Call the encoder to get the output of the encoder (the input is not needed), and the status information state_h and state UC
    encoder_outputs,state_h,state_c=encoder(encoder_inputs)
    # Discard encoder_ Output, we only need the state of the encoder
    encoder_state=[state_h,state_c]
    
    #Define decoder input
    decoder_inputs=Input(shape=(None,num_decoder_tokens))
    decoder_lstm=LSTM(latent_dim,return_state=True,return_sequences=True)
    # Take the state of the encoder output as the initial state of the initial decoder
    decoder_outputs,_,_=decoder_lstm(decoder_inputs,initial_state=encoder_state)
    #Add full connection layer
    decoder_dense=Dense(num_decoder_tokens,activation='softmax')
    decoder_outputs=decoder_dense(decoder_outputs)
    
    #Define the entire model
    model=Model([encoder_inputs,decoder_inputs],decoder_outputs)
       

Model diagram of model:



In each timestep, the decoder has three inputs: two state vectors from the encoder, state_h,state_c and one hot encoded Chinese sequence

Code 2.2

    #Define the encoder model to get the output encoder_states
    encoder_model=Model(encoder_inputs,encoder_state)
    
    decoder_state_input_h=Input(shape=(latent_dim,))
    decoder_state_input_c=Input(shape=(latent_dim,))
    decoder_state_inputs=[decoder_state_input_h,decoder_state_input_c]
    
    # Get the output and intermediate state of decoder
    decoder_outputs,state_h,state_c=decoder_lstm(decoder_inputs,initial_state=decoder_state_inputs)
    decoder_states=[state_h,state_c]
    decoder_outputs=decoder_dense(decoder_outputs)
    decoder_model=Model([decoder_inputs]+decoder_state_inputs,[decoder_outputs]+decoder_states)
    
    plot_model(model=model,show_shapes=True)
    plot_model(model=encoder_model,show_shapes=True)
    plot_model(model=decoder_model,show_shapes=True)
    return model,encoder_model,decoder_model

Model diagram of encoder:



Model diagram of decoder:

3. Coder predicts each character

First, encoder generates the state vector states according to the input sequence_ Value is passed to the input layer of the decoder in combination with the code containing the start character "\ t", and the position of the next character is predicted_ token_ Index, add the new predicted character to target_ In SEQ, one hot coding is performed again, and the state vector generated by predicting the last character is used as the new state vector.

The above process continues to loop in while until the end character "\ n" is predicted. The loop ends and the translated sentence is returned. From the figure below, it can be seen that the decoder part is a sequence generated after translation. Note that the blue line points to the target_squence, it's constantly being filled.

Code 3.1

def decode_sequence(input_seq,encoder_model,decoder_model):
    # Coding the input sequence to generate the state vector
    states_value = encoder_model.predict(input_seq)

    # Generate an empty sequence with size=1
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    # Set the contents of this empty sequence to the start character
    target_seq[0, 0, target_token_index['\t']] = 1.

    # Character recovery
    # For simplicity, suppose batch_size = 1
    stop_condition = False
    decoded_sentence = ''

    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)
#        print(output_ The output here is the probability of the location of the next character

        # Sample the next character_ token_ Index is to predict where the maximum probability of the next character will appear in the dictionary
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += sampled_char

        # Exit condition: generate or exceed the maximum sequence length
        if sampled_char == '\n' or len(decoded_sentence) >INUPT_LENGTH  :
            stop_condition = True

        # Update target_seq
        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.

        # Update intermediate status
        states_value = [h, c]

    return decoded_sentence

IV. training model

model,encoder_model,decoder_model=create_model()
#Compile model
model.compile(optimizer='rmsprop',loss='categorical_crossentropy')
#Training model
model.fit([encoder_input_data,decoder_input_data],decoder_target_data,
          batch_size=batch_size,
          epochs=epochs,
          validation_split=0.2)
#Well trained models are saved for later testing
model.save('s2s.h5')
encoder_model.save('encoder_model.h5')
decoder_model.save('decoder_model.h5')

Five. Exhibition

if __name__ == '__main__':
    intro=input("select train model or test model:")
    if intro=="train":
        print("Training mode...........")
        train()
    else:
        print("Test mode.........")
        while(1):
            test()

 

 

The training data used 3000 groups, most of which were short phrases or words. The effect is not so good, but it's better than English slag.

Reference:

https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html

https://towardsdatascience.com/neural-machine-translation-using-seq2seq-with-keras-c23540453c74

 

Posted by Davidc316 on Tue, 23 Jun 2020 02:28:55 -0700