Principles and Code Interpretation (2)
brief introduction
In the last article Article In this article, we will introduce the implementation of each step in detail in conjunction with Google's open source Transformer code.
Code Warehouse
We mainly use Transformer's V2 Edition
Attention module
Transformer is mainly divided into encoder and decoder, in which encoder and decoder have a multi-head attention module, and decoder has a masked self-attention module.
First, let's look at the attention module Code
class Attention(tf.keras.layers.Layer): """Multi-headed attention layer.""" def __init__(self, hidden_size, num_heads, attention_dropout): """Initialize Attention. Args: hidden_size: int, output dim of hidden layer. num_heads: int, number of heads to repeat the same attention structure. attention_dropout: float, dropout rate inside attention for training. """ if hidden_size % num_heads: raise ValueError( "Hidden size ({}) must be divisible by the number of heads ({})." .format(hidden_size, num_heads)) super(Attention, self).__init__() self.hidden_size = hidden_size self.num_heads = num_heads self.attention_dropout = attention_dropout def build(self, input_shape): """Builds the layer.""" # Layers for linearly projecting the queries, keys, and values. self.q_dense_layer = tf.keras.layers.Dense( self.hidden_size, use_bias=False, name="q") self.k_dense_layer = tf.keras.layers.Dense( self.hidden_size, use_bias=False, name="k") self.v_dense_layer = tf.keras.layers.Dense( self.hidden_size, use_bias=False, name="v") self.output_dense_layer = tf.keras.layers.Dense( self.hidden_size, use_bias=False, name="output_transform") super(Attention, self).build(input_shape) def get_config(self): return { "hidden_size": self.hidden_size, "num_heads": self.num_heads, "attention_dropout": self.attention_dropout, } def split_heads(self, x): """Split x into different heads, and transpose the resulting value. The tensor is transposed to insure the inner dimensions hold the correct values during the matrix multiplication. Args: x: A tensor with shape [batch_size, length, hidden_size] Returns: A tensor with shape [batch_size, num_heads, length, hidden_size/num_heads] """ with tf.name_scope("split_heads"): batch_size = tf.shape(x)[0] length = tf.shape(x)[1] # Calculate depth of last dimension after it has been split. depth = (self.hidden_size // self.num_heads) # Split the last dimension x = tf.reshape(x, [batch_size, length, self.num_heads, depth]) # Transpose the result return tf.transpose(x, [0, 2, 1, 3]) def combine_heads(self, x): """Combine tensor that has been split. Args: x: A tensor [batch_size, num_heads, length, hidden_size/num_heads] Returns: A tensor with shape [batch_size, length, hidden_size] """ with tf.name_scope("combine_heads"): batch_size = tf.shape(x)[0] length = tf.shape(x)[2] x = tf.transpose(x, [0, 2, 1, 3]) # --> [batch, length, num_heads, depth] return tf.reshape(x, [batch_size, length, self.hidden_size]) def call(self, x, y, bias, training, cache=None): """Apply attention mechanism to x and y. Args: x: a tensor with shape [batch_size, length_x, hidden_size] y: a tensor with shape [batch_size, length_y, hidden_size] bias: attention bias that will be added to the result of the dot product. training: boolean, whether in training mode or not. cache: (Used during prediction) dictionary with tensors containing results of previous attentions. The dictionary must have the items: {"k": tensor with shape [batch_size, i, key_channels], "v": tensor with shape [batch_size, i, value_channels]} where i is the current decoded length. Returns: Attention layer output with shape [batch_size, length_x, hidden_size] """ # Linearly project the query (q), key (k) and value (v) using different # learned projections. This is in preparation of splitting them into # multiple heads. Multi-head attention uses multiple queries, keys, and # values rather than regular attention (which uses a single q, k, v). q = self.q_dense_layer(x) k = self.k_dense_layer(y) v = self.v_dense_layer(y) if cache is not None: # Combine cached keys and values with new keys and values. k = tf.concat([tf.cast(cache["k"], k.dtype), k], axis=1) v = tf.concat([tf.cast(cache["v"], k.dtype), v], axis=1) # Update cache cache["k"] = k cache["v"] = v # Split q, k, v into heads. q = self.split_heads(q) k = self.split_heads(k) v = self.split_heads(v) # Scale q to prevent the dot product between q and k from growing too large. depth = (self.hidden_size // self.num_heads) q *= depth ** -0.5 # Calculate dot product attention logits = tf.matmul(q, k, transpose_b=True) logits += bias # Note that softmax internally performs math operations using float32 # for numeric stability. When training with float16, we keep the input # and output in float16 for better performance. weights = tf.nn.softmax(logits, name="attention_weights") if training: weights = tf.nn.dropout(weights, rate=self.attention_dropout) attention_output = tf.matmul(weights, v) # Recombine heads --> [batch_size, length, hidden_size] attention_output = self.combine_heads(attention_output) # Run the combined outputs through another linear projection layer. attention_output = self.output_dense_layer(attention_output) return attention_output
In transformer, the Attention class is a basic attention class, on which various attention mechanisms can be represented. This class inherits from the keras layer in tensorflow. The macro definition of the layer has three parameters:
- hidden size
- number of head
- attention dropout rate
The layer accepts two inputs X and y, both of which are [batch_size, length_x, hidden_size]
- First, we compute query from x, keys and values from y. Note that in order to compute attention s of multiple headers in a matrix, we combine W q, W k, W V W ^ q, W ^ k, W ^ vWq, Wk, W V of all headers, i.e. q_dense_layer, k_dense_layer and v_dense_layer.
- Then split to get q,k,vq,k,vq,k,v for each head.
- Divide qqq by d sqrt {d} D
- Calculate the attention score matrix for each head separately
- The purpose of adding bias to each attention score matrix and then passing through softmax and bias is mainly to mask certain positions.
- Multiply the attention score matrix with values for each head to get the output of each head
- concatenate the output of all headers by combining_heads
- Dimensionality reduction of concatenate output by a mapping matrix
Self-Attention
The implementation of Self-attention is also in the same file mentioned above, except that there is no longer a distinction between x,y and twenty-one inputs.
class SelfAttention(Attention): """Multiheaded self-attention layer.""" def call(self, x, bias, training, cache=None): return super(SelfAttention, self).call(x, x, bias, training, cache)
FFN
FFN Network Code It is clearly stated that it consists of two consecutive linear transformations, one of which is a Relu activation function.
class FeedForwardNetwork(tf.keras.layers.Layer): """Fully connected feedforward network.""" def __init__(self, hidden_size, filter_size, relu_dropout): """Initialize FeedForwardNetwork. Args: hidden_size: int, output dim of hidden layer. filter_size: int, filter size for the inner (first) dense layer. relu_dropout: float, dropout rate for training. """ super(FeedForwardNetwork, self).__init__() self.hidden_size = hidden_size self.filter_size = filter_size self.relu_dropout = relu_dropout def build(self, input_shape): self.filter_dense_layer = tf.keras.layers.Dense( self.filter_size, use_bias=True, activation=tf.nn.relu, name="filter_layer") self.output_dense_layer = tf.keras.layers.Dense( self.hidden_size, use_bias=True, name="output_layer") super(FeedForwardNetwork, self).build(input_shape) def get_config(self): return { "hidden_size": self.hidden_size, "filter_size": self.filter_size, "relu_dropout": self.relu_dropout, } def call(self, x, training): """Return outputs of the feedforward network. Args: x: tensor with shape [batch_size, length, hidden_size] training: boolean, whether in training mode or not. Returns: Output of the feedforward network. tensor with shape [batch_size, length, hidden_size] """ # Retrieve dynamically known shapes batch_size = tf.shape(x)[0] length = tf.shape(x)[1] output = self.filter_dense_layer(x) if training: output = tf.nn.dropout(output, rate=self.relu_dropout) output = self.output_dense_layer(output) return output
Add & Norm
Because each attention and FFN network passes through an Add&Norm layer, this part Code It is implemented as a layer wrap. Each layer converts input into output, then makes output layer normalization, and then adds the output of layer normalization to the original input.
class PrePostProcessingWrapper(tf.keras.layers.Layer): """Wrapper class that applies layer pre-processing and post-processing.""" def __init__(self, layer, params): super(PrePostProcessingWrapper, self).__init__() self.layer = layer self.params = params self.postprocess_dropout = params["layer_postprocess_dropout"] def build(self, input_shape): # Create normalization layer self.layer_norm = LayerNormalization(self.params["hidden_size"]) super(PrePostProcessingWrapper, self).build(input_shape) def get_config(self): return { "params": self.params, } def call(self, x, *args, **kwargs): """Calls wrapped layer with same parameters.""" # Preprocessing: apply layer normalization training = kwargs["training"] y = self.layer_norm(x) # Get layer output y = self.layer(y, *args, **kwargs) # Postprocessing: apply dropout and residual connection if training: y = tf.nn.dropout(y, rate=self.postprocess_dropout) return x + y
mask bias in attention
We mentioned earlier that we can implement a mask for a particular location by adding negative infinity to the attention score matrix.
padding mask for input and output
Because the sample length in each batch is different, we will express these different samples by zero padding as the same length, then the corresponding attention bias will express the corresponding position as negative infinite. The following are specific Code
def get_padding(x, padding_value=0, dtype=tf.float32): """Return float tensor representing the padding values in x. Args: x: int tensor with any shape padding_value: int value that dtype: The dtype of the return value. Returns: float tensor with same shape as x containing values 0 or 1. 0 -> non-padding, 1 -> padding """ with tf.name_scope("padding"): return tf.cast(tf.equal(x, padding_value), dtype) def get_padding_bias(x): """Calculate bias tensor from padding values in tensor. Bias tensor that is added to the pre-softmax multi-headed attention logits, which has shape [batch_size, num_heads, length, length]. The tensor is zero at non-padding locations, and -1e9 (negative infinity) at padding locations. Args: x: int tensor with shape [batch_size, length] Returns: Attention bias tensor of shape [batch_size, 1, 1, length]. """ with tf.name_scope("attention_bias"): padding = get_padding(x) attention_bias = padding * _NEG_INF_FP32 attention_bias = tf.expand_dims( tf.expand_dims(attention_bias, axis=1), axis=1) return attention_bias
- Since we do zero padding for all samples in each batch, we first judge whether a position is equal to 0 according to get_padding function. If it is 0, we set the position to 1, which means padding, otherwise 0 means not padding.
- Then multiply all positions by negative infinity, so that the position of padding is negative infinite, and the position without padding is zero, which has no effect on the probability of soft Max calculation.
- Note that the size of the original input x is [batch_size, length]. After the last two expand_dims operations, the size of bias is [batch_size, 1, 1, length]. The size of the attention score matrix is [batch_size, num_heads, length, length], so when they are combined, they will be broadcast, that is, the same sample, each head's attention score matrix will be added with a vector of the same size, that is, the broadcast of one-dimensional vector on two-dimensional matrix.
bias in masked selft attention
Because our prediction is done from left to right in the decoder section, that is, the output at the moment is determined by the previous output and the current input. So each q should only attention to the values before and at the moment, not attention to the values after. This part of the Code Similarly, the corresponding position is expressed as negative infinity.
def get_decoder_self_attention_bias(length, dtype=tf.float32): """Calculate bias for decoder that maintains model's autoregressive property. Creates a tensor that masks out locations that correspond to illegal connections, so prediction at position i cannot draw information from future positions. Args: length: int length of sequences in batch. dtype: The dtype of the return value. Returns: float tensor of shape [1, 1, length, length] """ neg_inf = _NEG_INF_FP16 if dtype == tf.float16 else _NEG_INF_FP32 with tf.name_scope("decoder_self_attention_bias"): valid_locs = tf.linalg.band_part(tf.ones([length, length], dtype=dtype), -1, 0) valid_locs = tf.reshape(valid_locs, [1, 1, length, length]) decoder_bias = neg_inf * (1.0 - valid_locs) return decoder_bias
EncoderStack
Let's take a look at the multi-tier encoder section Code
class EncoderStack(tf.keras.layers.Layer): """Transformer encoder stack. The encoder stack is made up of N identical layers. Each layer is composed of the sublayers: 1. Self-attention layer 2. Feedforward network (which is 2 fully-connected layers) """ def __init__(self, params): super(EncoderStack, self).__init__() self.params = params self.layers = [] def build(self, input_shape): """Builds the encoder stack.""" params = self.params for _ in range(params["num_hidden_layers"]): # Create sublayers for each layer. self_attention_layer = attention_layer.SelfAttention( params["hidden_size"], params["num_heads"], params["attention_dropout"]) feed_forward_network = ffn_layer.FeedForwardNetwork( params["hidden_size"], params["filter_size"], params["relu_dropout"]) self.layers.append([ PrePostProcessingWrapper(self_attention_layer, params), PrePostProcessingWrapper(feed_forward_network, params) ]) # Create final layer normalization layer. self.output_normalization = LayerNormalization(params["hidden_size"]) super(EncoderStack, self).build(input_shape) def get_config(self): return { "params": self.params, } def call(self, encoder_inputs, attention_bias, inputs_padding, training): """Return the output of the encoder layer stacks. Args: encoder_inputs: tensor with shape [batch_size, input_length, hidden_size] attention_bias: bias for the encoder self-attention layer. [batch_size, 1, 1, input_length] inputs_padding: tensor with shape [batch_size, input_length], inputs with zero paddings. training: boolean, whether in training mode or not. Returns: Output of encoder layer stack. float32 tensor with shape [batch_size, input_length, hidden_size] """ for n, layer in enumerate(self.layers): # Run inputs through the sublayers. self_attention_layer = layer[0] feed_forward_network = layer[1] with tf.name_scope("layer_%d" % n): with tf.name_scope("self_attention"): encoder_inputs = self_attention_layer( encoder_inputs, attention_bias, training=training) with tf.name_scope("ffn"): encoder_inputs = feed_forward_network( encoder_inputs, training=training) return self.output_normalization(encoder_inputs)
Each layer is very simple, that is, after self-attention, it passes through a FFN layer. It should be noted that the last layer also has an output_normalization.
DecoderStack
Decoder stack Code It's also very simple. Each layer is masked self-attention by the input sequence, then multi-head attention by the output of encoder, and finally through a FFN layer.
class DecoderStack(tf.keras.layers.Layer): """Transformer decoder stack. Like the encoder stack, the decoder stack is made up of N identical layers. Each layer is composed of the sublayers: 1. Self-attention layer 2. Multi-headed attention layer combining encoder outputs with results from the previous self-attention layer. 3. Feedforward network (2 fully-connected layers) """ def __init__(self, params): super(DecoderStack, self).__init__() self.params = params self.layers = [] def build(self, input_shape): """Builds the decoder stack.""" params = self.params for _ in range(params["num_hidden_layers"]): self_attention_layer = attention_layer.SelfAttention( params["hidden_size"], params["num_heads"], params["attention_dropout"]) enc_dec_attention_layer = attention_layer.Attention( params["hidden_size"], params["num_heads"], params["attention_dropout"]) feed_forward_network = ffn_layer.FeedForwardNetwork( params["hidden_size"], params["filter_size"], params["relu_dropout"]) self.layers.append([ PrePostProcessingWrapper(self_attention_layer, params), PrePostProcessingWrapper(enc_dec_attention_layer, params), PrePostProcessingWrapper(feed_forward_network, params) ]) self.output_normalization = LayerNormalization(params["hidden_size"]) super(DecoderStack, self).build(input_shape) def get_config(self): return { "params": self.params, } def call(self, decoder_inputs, encoder_outputs, decoder_self_attention_bias, attention_bias, training, cache=None): """Return the output of the decoder layer stacks. Args: decoder_inputs: tensor with shape [batch_size, target_length, hidden_size] encoder_outputs: tensor with shape [batch_size, input_length, hidden_size] decoder_self_attention_bias: bias for decoder self-attention layer. [1, 1, target_len, target_length] attention_bias: bias for encoder-decoder attention layer. [batch_size, 1, 1, input_length] training: boolean, whether in training mode or not. cache: (Used for fast decoding) A nested dictionary storing previous decoder self-attention values. The items are: {layer_n: {"k": tensor with shape [batch_size, i, key_channels], "v": tensor with shape [batch_size, i, value_channels]}, ...} Returns: Output of decoder layer stack. float32 tensor with shape [batch_size, target_length, hidden_size] """ for n, layer in enumerate(self.layers): self_attention_layer = layer[0] enc_dec_attention_layer = layer[1] feed_forward_network = layer[2] # Run inputs through the sublayers. layer_name = "layer_%d" % n layer_cache = cache[layer_name] if cache is not None else None with tf.name_scope(layer_name): with tf.name_scope("self_attention"): decoder_inputs = self_attention_layer( decoder_inputs, decoder_self_attention_bias, training=training, cache=layer_cache) with tf.name_scope("encdec_attention"): decoder_inputs = enc_dec_attention_layer( decoder_inputs, encoder_outputs, attention_bias, training=training) with tf.name_scope("ffn"): decoder_inputs = feed_forward_network( decoder_inputs, training=training) return self.output_normalization(decoder_inputs)
Encode
Encoded part of the whole Code It maps the original input to input embeddings, then adds positional embeddings and passes through EncodeStack.
def encode(self, inputs, attention_bias, training): """Generate continuous representation for inputs. Args: inputs: int tensor with shape [batch_size, input_length]. attention_bias: float tensor with shape [batch_size, 1, 1, input_length]. training: boolean, whether in training mode or not. Returns: float tensor with shape [batch_size, input_length, hidden_size] """ with tf.name_scope("encode"): # Prepare inputs to the layer stack by adding positional encodings and # applying dropout. embedded_inputs = self.embedding_softmax_layer(inputs) embedded_inputs = tf.cast(embedded_inputs, self.params["dtype"]) inputs_padding = model_utils.get_padding(inputs) attention_bias = tf.cast(attention_bias, self.params["dtype"]) with tf.name_scope("add_pos_encoding"): length = tf.shape(embedded_inputs)[1] pos_encoding = model_utils.get_position_encoding( length, self.params["hidden_size"]) pos_encoding = tf.cast(pos_encoding, self.params["dtype"]) encoder_inputs = embedded_inputs + pos_encoding if training: encoder_inputs = tf.nn.dropout( encoder_inputs, rate=self.params["layer_postprocess_dropout"]) return self.encoder_stack( encoder_inputs, attention_bias, inputs_padding, training=training)
The input accepted by the encode part of the whole is an id representation of the size [batch_size, length], and then the input embedding is obtained by embedding_softmax_layer, where embedding_softmax_layer can be understood as a mapping matrix of the size [vocab_size, hidden_size], which is defined as
self.embedding_softmax_layer = embedding_layer.EmbeddingSharedWeights( params["vocab_size"], params["hidden_size"], dtype=params["dtype"])
Decode
Decode Code Similar to the Encode part, the original input is mapped to output embeddings first, but the decode part moves output embeddings one bit to the right, then passes through DecoderStack, and finally uses the same embedding_soft_max_layer to get the final probability of each vocab ulary.
Notice that the output size of DecodeStack is [batch_size, target_length, hidden_size], and the output size of DecodeStack is [batch_size, target_length, vocab]. After multiplying the matrix in embedding_soft_max_layer with [vocab, hidden_size], the output size is [batch_size, target_length, vocab]. Then we can proceed according to the probability of different positions of vocab. The line is predicted.
def decode(self, targets, encoder_outputs, attention_bias, training): """Generate logits for each value in the target sequence. Args: targets: target values for the output sequence. int tensor with shape [batch_size, target_length] encoder_outputs: continuous representation of input sequence. float tensor with shape [batch_size, input_length, hidden_size] attention_bias: float tensor with shape [batch_size, 1, 1, input_length] training: boolean, whether in training mode or not. Returns: float32 tensor with shape [batch_size, target_length, vocab_size] """ with tf.name_scope("decode"): # Prepare inputs to decoder layers by shifting targets, adding positional # encoding and applying dropout. decoder_inputs = self.embedding_softmax_layer(targets) decoder_inputs = tf.cast(decoder_inputs, self.params['dtype']) attention_bias = tf.cast(attention_bias, self.params["dtype"]) with tf.name_scope("shift_targets"): # Shift targets to the right, and remove the last element decoder_inputs = tf.pad(decoder_inputs, [[0, 0], [1, 0], [0, 0]])[:, :-1, :] with tf.name_scope("add_pos_encoding"): length = tf.shape(decoder_inputs)[1] pos_encoding = model_utils.get_position_encoding( length, self.params["hidden_size"]) pos_encoding = tf.cast(pos_encoding, self.params["dtype"]) decoder_inputs += pos_encoding if training: decoder_inputs = tf.nn.dropout( decoder_inputs, rate=self.params["layer_postprocess_dropout"]) # Run values decoder_self_attention_bias = model_utils.get_decoder_self_attention_bias( length, dtype=self.params['dtype']) outputs = self.decoder_stack( decoder_inputs, encoder_outputs, decoder_self_attention_bias, attention_bias, training=training) logits = self.embedding_softmax_layer(outputs, mode="linear") logits = tf.cast(logits, tf.float32) return logits
Why move one to the right?
If we input output embedding directly into the network for learning, decoder will simply learn how to copy decoder input. Because the decoder's ith target token is the decoder's ith input token, and our decoder should predict the ith target token based on the decoder's input token before the ith position. By moving decoder's input token one bit to the right, we can achieve our goal.