TensorFlow|Transformer-based Natural Language Inference (SNLI)

Keywords: encoding github

After going through the process of reading the paper, the source code and Bert source code, I sorted out my ideas, implemented a Transformer, and built a small Transformer to do some SNLI tasks.

1.Transofrmer

The principle is not repeated anymore. It's good in other blogs.

For example: https://jalammar.github.io/illustrated-transformer/

And his translation: https://blog.csdn.net/qq_41664845/article/details/84969266

Direct Entry Code

1.1 Activation Function

Transformer originally used Relu, but Bert included later work, mostly Gelu. Gauss error linear unit ), the effect is better (only referring to the data comparison in the paper, but not experimenting with it in person).

The default activation function is set to Gelu, even if most Relu is used normally, based on the principle of promoting virtue without raising relatives.

Original paper on Gelu: https://arxiv.org/abs/1606.08415

Gelu:

def gelu(inputs):
    """
    gelu: https://arxiv.org/abs/1606.08415
    :param inputs: [Tensor]
    :return: [Tensor] outputs after activation
    """
    cdf = 0.5 * (1.0 + tf.tanh(tf.sqrt(2 / np.pi) * (inputs + 0.044715 * tf.pow(inputs, 3))))
    return inputs * cdf

How to get the activation function (set default gelu):

def get_activation(activation_name):
    """
    get activate function
    :param activation_name: [Tensor]
    :return: [Function] activation function
    """
    if activation_name is None:
        return gelu
    else:
        act = activation_name.lower()
        if act == "relu":
            return tf.nn.relu
        elif act == "gelu":
            return gelu
        elif act == "tanh":
            return tf.tanh
        else:
            raise ValueError("Unsupported activation: %s" % act)

1.2 embedding

In addition to word embedding, Transformer also makes Positional Encoding so that each word carries location information. Otherwise, you can imagine that it is just a more complex bag model that trains to get the weight of each word.

In order to accomplish tasks such as SNLI that require a consistent final output shape, Bert's idea is used to add [CLS]token to the start of each input and predict the final output of that token. In doing so, segment embedding is added to better distinguish two different sentences (refer to Bert).

Word Embedding

This can be done by randomly initializing the embedded matrix, or by loading a word-embedded matrix generated by other tasks (such as Glove, Fast text), just declaring it at restore.The paper mentions the need to scale embedding, which is done here.

def get_embedding(inputs, vocab_size, channels, scale=True, scope="embedding", reuse=None):
    """
    embedding
    :param inputs: [Tensor] Tensor with first dimension of "batch_size"
    :param vocab_size: [Int] Vocabulary size
    :param channels: [Int] Embedding size
    :param scale: [Boolean] If True, the output will be multiplied by sqrt num_units
    :param scope: [String] name of "variable_scope"
    :param reuse: [Boolean] tf parameter reuse
    :return: [Tensor] outputs of embedding of sentence with shape of "batch_size * length * channels"
    """
    with tf.variable_scope(scope, reuse=reuse):
        lookup_table = tf.get_variable('lookup_table',
                                       dtype=tf.float32,
                                       shape=[vocab_size, channels],
                                       initializer=tf.contrib.layers.xavier_initializer())
        lookup_table = tf.concat((tf.zeros(shape=[1, channels], dtype=tf.float32),
                                  lookup_table[1:, :]), 0)

        outputs = tf.nn.embedding_lookup(lookup_table, inputs)

        if scale:
            outputs = outputs * math.sqrt(channels)

    return outputs

1.2.2 Position Embedding

To get the same shape embedded as inputs after word embedding, instead of using word embedding as input, consider this to facilitate subsequent mask s

def get_positional_encoding(inputs, channels, scale=False, scope="positional_embedding", reuse=None):
    """
    positional encoding
    :param inputs: [Tensor] with dimension of "batch_size * max_length"
    :param channels: [Int] Embedding size
    :param scale: [Boolean] If True, the output will be multiplied by sqrt num_units
    :param scope: [String] name of "variable_scope"
    :param reuse: [Boolean] tf parameter reuse
    :return: [Tensor] outputs after positional encoding
    """
    batch_size = tf.shape(inputs)[0]
    max_length = tf.shape(inputs)[1]
    with tf.variable_scope(scope, reuse=reuse):
        position_ind = tf.tile(tf.expand_dims(tf.range(tf.to_int32(1), tf.add(max_length, 1)), 0), [batch_size, 1])

        # Convert to a tensor
        lookup_table = tf.convert_to_tensor(get_timing_signal_1d(max_length, channels))

        lookup_table = tf.concat((tf.zeros(shape=[1, channels]),
                                  lookup_table[:, :]), 0)
        position_inputs = tf.where(tf.equal(inputs, 0), tf.zeros_like(inputs), position_ind)

        outputs = tf.nn.embedding_lookup(lookup_table, position_inputs)

        if scale:
            outputs = outputs * math.sqrt(channels)

    return tf.cast(outputs, tf.float32)

Obtain the matrix of [Sentence Length* embedding Dimension] by get_timing_signal_1d()

def get_timing_signal_1d(length, channels, min_timescale=1.0, max_timescale=1.0e4, start_index=0):
    """
    positional encoding Method
    :param length: [Int] max_length size
    :param channels: [Int] Embedding size
    :param min_timescale: [Float]
    :param max_timescale: [Float]
    :param start_index: [Int] index of first position
    :return: [Tensor] positional encoding of shape "length * channels"
    """
    position = tf.to_float(tf.range(start_index, length))
    num_timescales = channels // 2
    log_timescale_increment = (math.log(float(min_timescale) / float(max_timescale)) /
                               (tf.to_float(num_timescales) - 1))
    inv_timescales = min_timescale * tf.exp(tf.to_float(tf.range(num_timescales)) * -log_timescale_increment)

    scaled_time = tf.expand_dims(position, 1) * tf.expand_dims(inv_timescales, 0)
    signal = tf.concat([tf.sin(scaled_time), tf.cos(scaled_time)], axis=1)
    signal = tf.pad(signal, [[0, 0], [0, tf.mod(channels, 2)]])
    return signal

1.2.3Segment Embedding

This embedding is just to make the model better distinguish between the two sentences that are entered. In fact, the token [SEP] and the ability to distinguish the two sentences are not enough for the model. Without the segment embedding, the model does not perform well.

For the [PAD] token, all embedding (seg, pos) are set to zero vectors so that mask s are added for subsequent attention s

def get_seg_embedding(inputs, channels, order=1, scale=True, scope="seg_embedding", reuse=None):
    """
    segment embedding
    :param inputs: [Tensor] with first dimension of "batch_size"
    :param channels: [Int] Embedding size
    :param order: [Int] The position of the sentence in all sentences
    :param scale: [Boolean] If True, the output will be multiplied by sqrt num_units
    :param scope: [String] name of "variable_scope"
    :param reuse: [Boolean] tf parameter reuse
    :return: [Tensor] outputs of embedding of sentence with shape of "batch_size * length * channels"
    """
    with tf.variable_scope(scope, reuse=reuse):
        lookup_table = tf.get_variable('lookup_table',
                                       dtype=tf.float32,
                                       shape=[3, channels],
                                       initializer=tf.contrib.layers.xavier_initializer())
        lookup_table = tf.concat((tf.zeros(shape=[1, channels], dtype=tf.float32),
                                  lookup_table[1:, :]), 0)
        seg_inputs = tf.where(tf.equal(inputs, 0), tf.zeros_like(inputs), tf.ones_like(inputs)*order)
        outputs = tf.nn.embedding_lookup(lookup_table, seg_inputs)
        if scale:
            outputs = outputs * math.sqrt(channels)

    return outputs

1.3Self-Attention and Encoder-Decoder Attention

Here, the input is processed, and the Attention mechanism is the key

The tensor of the two inputs always feels that one line is not clear in English, so let's write here. from tensor is the same input for both Attentions and to tensor is the same for self-attention, but to encoder-decoder attention is the last output of encoder to capture the attention relationship between decoder and encoder.

Because all [PAD] tokens have embedding s that are all zero, zero is the [PAD] token after the absolute value of the dimension is reduced, so there is no need to add an additional mask ids as input.

Follow the description in the paper

def multi_head_attention(from_tensor: tf.Tensor,  to_tensor: tf.Tensor, channels=None, num_units=None, num_heads=8,
                         dropout_rate=0, is_training=True, attention_mask_flag=False, scope="multihead_attention",
                         activation=None, reuse=None):
    """
    multihead attention
    :param from_tensor: [Tensor]
    :param to_tensor: [Tensor] 
    :param channels: [Int] channel of last dimension of output
    :param num_units: [Int] channel size of matrix Q, K, V
    :param num_heads: [Int] head number of attention
    :param dropout_rate: [Float] dropout rate when 0 means no dropout
    :param is_training: [Boolean] whether it is training, If true, use dropout
    :param attention_mask_flag: [Boolean] If true, units that reference the future are masked
    :param scope: [String] name of "variable_scope"
    :param activation: [String] name of activate function
    :param reuse: [Boolean] tf parameter reuse
    :return: [Tensor] outputs after multihead self attention with shape of "batch_size * max_length * (channels*num_heads)"
    """
    with tf.variable_scope(scope, reuse=reuse):
        if channels is None:
            channels = from_tensor.get_shape().as_list()[-1]
        if num_units is None:
            num_units = channels//num_heads
        activation_fn = get_activation(activation)
        # shape [batch_size, max_length, channels*num_heads]
        query_layer = tf.layers.dense(from_tensor, num_units * num_heads, activation=activation_fn)
        key_layer = tf.layers.dense(to_tensor, num_units * num_heads, activation=activation_fn)
        value_layer = tf.layers.dense(to_tensor, num_units * num_heads, activation=activation_fn)

        # shape [batch_size*num_heads, max_length, channels]
        query_layer_ = tf.concat(tf.split(query_layer, num_heads, axis=2), axis=0)
        key_layer_ = tf.concat(tf.split(key_layer, num_heads, axis=2), axis=0)
        value_layer_ = tf.concat(tf.split(value_layer, num_heads, axis=2), axis=0)

        # shape = [batch_size*num_heads, max_length, max_length]
        attention_scores = tf.matmul(query_layer_, tf.transpose(key_layer_, [0, 2, 1]))
        # Scale
        attention_scores = tf.multiply(attention_scores, 1.0 / tf.sqrt(float(channels)))
        # attention masks
        attention_masks = tf.sign(tf.abs(tf.reduce_sum(to_tensor, axis=-1)))
        attention_masks = tf.tile(attention_masks, [num_heads, 1])
        attention_masks = tf.tile(tf.expand_dims(attention_masks, axis=1), [1, tf.shape(from_tensor)[1], 1])
        neg_inf_matrix = tf.multiply(tf.ones_like(attention_scores), (-math.pow(2, 32) + 1))
        attention_scores = tf.where(tf.equal(attention_masks, 0), neg_inf_matrix, attention_scores)

        if attention_mask_flag:
            diag_vals = tf.ones_like(attention_scores[0, :, :])
            tril = tf.linalg.LinearOperatorLowerTriangular(diag_vals).to_dense()

            masks = tf.tile(tf.expand_dims(tril, 0), [tf.shape(attention_scores)[0], 1, 1])
            neg_inf_matrix = tf.multiply(tf.ones_like(masks), (-math.pow(2, 32) + 1))
            attention_scores = tf.where(tf.equal(masks, 0), neg_inf_matrix, attention_scores)

        # attention probability
        attention_probs = tf.nn.softmax(attention_scores)

        # query mask
        query_masks = tf.sign(tf.abs(tf.reduce_sum(from_tensor, axis=-1)))
        query_masks = tf.tile(query_masks, [num_heads, 1])
        query_masks = tf.tile(tf.expand_dims(query_masks, -1), [1, 1, tf.shape(to_tensor)[1]])

        attention_probs *= query_masks

        # dropout
        attention_probs = tf.layers.dropout(attention_probs, rate=dropout_rate,
                                            training=tf.convert_to_tensor(is_training))
        outputs = tf.matmul(attention_probs, value_layer_)
        # shape [batch_size, max_length, channels*num_heads]
        outputs = tf.concat(tf.split(outputs, num_heads, axis=0), axis=2)

        # reshape to from tensor
        outputs = tf.layers.dense(outputs, channels, activation=activation_fn)
        # Residual connection
        outputs += from_tensor
        # group normalization
        outputs = group_norm(outputs)
    return outputs

1.4Feed Ward

Position-wise Feed-Forward Networks in the paper, the activation function of the second layer in the paper is linear activation function, changing the activation function parameter of the second layer to None is the practice of the original paper, which was not done for some experimental reasons

def feed_forward(inputs, channels, hidden_dims=None, scope="multihead_attention", activation=None, reuse=None):
    """
    :param inputs: [Tensor] with first dimension of "batch_size"
    :param channels: [Int] Embedding size
    :param hidden_dims: [List] hidden dimensions
    :param scope: [String] name of "variable_scope"
    :param activation: [String] name of activate function
    :param reuse: [Boolean] tf parameter reuse
    :return: [Tensor] outputs after feed forward with shape of "batch_size * max_length * channels"
    """
    if hidden_dims is None:
        hidden_dims = 2*channels
    with tf.variable_scope(scope, reuse=reuse):
        activation_fn = get_activation(activation)

        params = {"inputs": inputs, "num_outputs": hidden_dims, "activation_fn": activation_fn}
        outputs = tf.contrib.layers.fully_connected(**params)

        params = {"inputs": outputs, "num_outputs": channels, "activation_fn": activation_fn}  # Activaon_fn can be changed to None
        outputs = tf.contrib.layers.fully_connected(**params)
        outputs += inputs
        outputs = group_norm(outputs)
    return outputs

1.5Layer Normalization

Yes, there is also layer normalization.

def group_norm(inputs: tf.Tensor, epsilon=1e-8, scope="layer_normalization", reuse=None):
    """
    layer normalization
    :param inputs: [Tensor] with first dimension of "batch_size"
    :param epsilon: [Float] a number for preventing ZeroDivision
    :param scope: [String] name of "variable_scope"
    :param reuse: [Boolean] tf parameter reuse
    :return: [Tensor] outputs after normalized
    """
    with tf.variable_scope(scope, reuse=reuse):
        inputs_shape = inputs.get_shape()
        params_shape = inputs_shape[-1:]
        mean, variance = tf.nn.moments(inputs, [-1], keep_dims=True)
        beta = tf.Variable(tf.zeros(params_shape))
        gamma = tf.Variable(tf.ones(params_shape))
        normalized = (inputs - mean) * tf.rsqrt(variance + epsilon)
        outputs = gamma * normalized + beta
    return outputs

2 Not finished yet.

Posted by zimick on Mon, 06 May 2019 20:40:39 -0700

Programmer Group