Transformers – Attention is all you need

Transformers are getting more and more important not just in NLP but now its going to extend its surface area into other areas of deep learning beyond just language. Google has rolled out BERT and transformer based models to google search, they have been using them to empower google search and they call it one of the biggest leaps forward in the the history of search.

In this blog, we’ll focus on one paper that started it all, Attention is all you need!

Below is the architecture of the model as mentioned in the paper.

On a highlevel, like any RNN based and CNN based sequence to sequence structure, Transformers is composed of an encoder and a decoder. The encoder converts the original input sequence into its latent representation in the form of hidden state vectors. The decoder tries to predict the output sequence using this latent representation.

The input (source) and output (target) sequence embeddings are added with positional encoding before being fed into the encoder and the decoder. The encoding component is a identical stack of encoders and the decoding component is a identical stack of decoders of the same number, the paper used 6 stacks but we will be using only 3 stacks of encoder and decoder layers.

Each encoder consists of two sub layers

  • a self-attention layer – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word.
  • a feed-forward neural network , the exact same feed-forward network is independently applied to the word in each position through its own path in the encoder, hence it is called postion wise feed forward neural network.

The decoder also has self attention and feed forward layers, but between them is another attention layer that helps the decoder focus on relevant parts of the input sentence

Now that we’ve seen the highlevel components of the model, we will deep dive and further understand each of these individual components in detail.

Data Processing

The first step towards any machine learning is input data processing, the input in our case is going to be the German Sentence. However, computers don’t understand text the only language they understand is of matrices and numbers, we will therefore have to transform the input text into numbers.

So to do that, first we take all the words present in the training data and create a vocabulary out of it. If our training data is as big as a Wikipedia, our vocabulary can be composed of all the words in English language. Next, we assign a numeric index next to each word and then we pick only the words that occur in the current input text. Therefore, what gets fed into the transformer are not the German/English words but their corresponding indices.

Encoder

Let’s now have a closer look at the Encoder structure.

Once the data is prepared these inputs are passed on to the next layer, that is the embedding layer. The embedding layer too has an index for every word in the vocabulary and against each of those indices a vector is attached, initially these vectors are filled up with random numbers. Later on during training phase the model updates them with values that better help them with the assigned task.

The transformer paper, have gone with the embedding size of 512 and we will use the same here.

So what are word embeddings?

Well, these are just vector representation of a given word. Each dimension on word embeddings tries to capture about some concept or linguistic feature of that word, these could be things like whether the word is a verb, preposition, or an entity or something else.But in reality since the model decides these features itself during training, it could be difficult to find out exactly what information do each of these dimensions represent.

Graphically, the values of these dimensions represent the coordinates of the given word in some hyperspace. If two words share similar linguistic features and appear in similar contexts their embedding values are updated to become closer and closer during the training process.

For example consider these two words ‘King’ and ‘Queen’, initially their embeddings are randomly initialized but during the course of training they may become more and more similar since these two words often appear in similar context. This is when compared to the word ‘School’ that often appears in a whole different context. Hence the embedding layer selects the embedding corresponding to the input text and passes them further on. The embedding layer takes input indicies and converts them into word embeddings then these get passed further on to the next layer, the positional embeddings.

Positional embeddings, why do we need them?

If RNNs were to take up these embeddings it would do so sequentially, one embedding at a time, which is why they are so slow. There is a positive side to this however, since RNNs take the embeddings sequentially in the designated order they know which word came first which word came second and so on . Transformers on the other hand take up all embeddings at once. Now, even though this is a huge plus and makes transformers much faster the down side is that they loose the critical information related to word ordering. In simple words, they are not aware of which word came first in the sequence and which word came last, here’s why positional information matters.

Even though she did not win the award, she was satisfied.

Even though she did win the award, she was not satisfied.

Notice how the position of the single word not, not only changed the sentiment but also the meaning of the sentence.So what do we do to bring back the order information to the transformers without having to make them recurrent like RNNs. How about we introduce a new set of vectors containing the position information? Let us call them position embeddings.

We can start by simply adding the word embeddings to their corresponsing position embeddings and create a new order aware word embedings. But what values should our position embeddings contain, we start by literally filling in the word position numbers so the first postion embeddings has zeros, the next has all one’s and so on.

Before summing up with the position embedding, the token embeddings are multiplied by a scaling factor, which is square root of hidden dimension. This helps reduce variance in the embeddings. Dropout is then applied to the combined embeddings.

The combined embeddings are further passed on to the Encoding Layer along with the src_mask. The src_mask (source mask), is the same shape as the source sentence but has a value of 1 when the token in the source sentence is not a <pad> token and 0 when it is a <pad> token. This is used in the encoder layers to mask the multi-head attention mechanisms, which are used to calculate and apply attention over the source sentence, so the model does not pay attention to <pad> tokens as it doenot contain any useful information.

Encoder Layer

At a high level, an encoder receives a list of vectors as input. It processes this list by passing these vectors into a ‘self-attention’ layer, then into a feed-forward neural network, then sends out the output upwards to the next encoder.

Here

  • we pass the source sentence and its mask into the multi-head attention layer
  • the output of the embedding layer uses the high way residual connection to go on to the next layer ( “Add and Norm” block ), we also have the output of the multihead attention layer going to the “Add and Norm” block
  • within the “Add and Norm” block, the add part performs simple addition of the two inputs that gets fed to this block and we get a summed up output. The norm or the normalization , standardised the neurons activations along the axis of the features.
  • the output of add and norm is then passed through a position-wise feedforward layer
  • the output of this is again passed to another “Add and Norm” block (a set of dropout, residual connection and a Normalization layer)

“Add and Norm” within the transformer block plays a key role this is used to connect the inputs and outputs of other layers smoothly.

We add a layer that contains a residual structure and a layer normalization after both the multi-head attention layer and the position-wise FFN network.

Layer normalization can be thought of similar to batch normalization. Basically, we take each of the neurons activation and subtract the mean from them, we then divide the value with the standard deviation and finally add a small value to the denominator just to make sure that it never lands up being zero. One difference is that the mean and variances for the layer normalization are calculated along the last dimension (axis=-1) instead of the first batch dimension (axis=0). Pytoch provide a inbuilt function nn.LayerNorm for this.

Layer normalization prevents the range of values in the layers from changing too much, which allows faster training and better generalization ability.

Next, lets understand Multi-Head Attension layer.

Why do we need attension?

Attention mechanism helps the model to focus on important words in a given input sentence, transformers did not use simple attention, they used something called self attention. Consider this sentence, “He went to the bank to deposit some money, after which he went to a river bank for a walk.” Note how same word bank means two different things, here the first occurance of bank is referring to a finance institution while the second bank refers to the side of a river. So how can a model know which bank refers to what, we humans judge the meaning of the word by paying attention to the context in which it appears. For instance, deposit and money can indicate that first occurance of the word bank refers to financial institute, the word river indicates the second occurance means a river bank like wise the meaning of every word can be regarded as the sum of the words it pays the most attention to.

Now, the difference between simple and self attention is that, simple attention selectively focuses on words wrt to some external query, the more important the word is in determining the answer to that query the more focus it is given. Self attention on other hand also takes the relation ship among words within the same sentence into account and this is the layer where attention computation happen.

So let us dive further and understand the components of Multi-Head Attension layer.

The first component in this block are three linear layers, a linear layer is simply composed of a bunch of fully connected neurons without the activation function. They serve two main purposes:

  1. mapping inputs onto the outputs
  2. changing the (matrix/vector) dimension of the inputs themselves.

We take the embbedings of size 512, pass it to a linear layer and shrink the size to 256 . One of many reasons why we want to change or shrink the dimensions of an embedding vector is to save on the computation cost, the larger the vector the more operation it reqiures. Each node in the linear layer is connected to the input using its own set of weights, so what are these weights? well, these are just scalar numbers that the model updates during back propogation as it gets better and beter at the downstream task which in our case is machine traslation. It is also important to point out that these weights are fed to the model as a matrix. So we are done looking at the functionality of a single linear layer.

But the transformers have three seperate linear layer. Why is that? it turns out that each one of these layers has a special function we call them the Query, the Key and the Value linear layers.

This can be partially motivated by the way retrieval systems work. When we often type search requests on youtube, let us call the search request as a Query. Now let us assume that youtube search algorithm is quite simplistic, what it does is go through all video titles in its database, these titles can be termed as the Keys, now so as to find the best matches it will have to compute some sort of similarities between our Query and the corresponding Keys. Once the most similar key has been found it returns the video affiliated with that Key, we will call the contents of a video its Value. Notice how similarity can be thought of as a proxy to attention , this is because the model returns the best video only by paying attention to the most similar video title when compared to the search query.

Great, but how do we compute the similarity between a Query and Key? A great way to compute similaity between two vectors is with the cosine similarity, the cosine similarity can also be obtained by taking the dot product between the elements of the two vectors and then dividing by their magnitute for scaling purposes. Now, if you are to compute the similarity between matrix elements instead of vector, we will have to transpose the second element to avoid conflicts in dimensions during matrix multiplication.

How does this tie back to our attention layer and what exactly should we feed to Q,K and V linear layers?

To the query layer, we feed position aware embeddings, we then make two more copies of the embedding and feed the same to the key and the value layers. I know that makes no sense, because in the youtube examples didn’t the queries, keys and values mean different things and had very different contents. So why then here we are using the same content as input to Query, Key and the Value layers, well that is where the self attension part comes into play.

We take three embedding copies and then pass then through each of the linear layer, all that means is that we multiple the embeding layer with the weights of the linear layer. Note that each linear layer has its own set of weights. Since the matrix multiplicaton requires specific dimension we will have to transpose our embedding dimension accordingly. After multiplication each linear layer outputs a matrix and these are called the query, key and value matrices.

  • First we do a simple dot product of the Query and the transpose of key matrix, the output of the dot product can be called an attention filter.
  • Since this very important output lets understand its content. At the start of the training process the contents of the attention filter are more or less random numbers, but once the training process is done they take on more meaning full values. The scores inside this matrix are attenstion scores.
  • We then scale our attention score. The authors of the paper divided the score by the dimension of the key vector .
  • Finally we squash our attension score between the values of 0 and 1, using a softmax function and we get our final attention filter.
  • We also have a input that was passed to value linear layer to generate the value matrix

So we now have the original value matrix which pretty much represents the orginal embedding information because we did not alter them much except passing through a single linear layer. On the other hand we have attention filter computed using the dot product of q and k matrices.

When we multiply the attention filter with the value matrix we get a filter value matrix which assigns high focus to the features that are more important and this filtered value matrix is the final output of our multi head attention layer.

Transformers dont learn one attention filter they learn multiple each focusing on a different linguistic feature. Each attention head therefore outputs its own attention filter which inturn outputs its own filtered value matrix each zooming in on a different combination of linguistic features. In the paper, the authors used a total of 8 attention heads and we will be using the same.

What do we do next?

We simply goahead and concatenate them together. Since we dont want this vector to grow longer and longer with each head used, we pass it through a linear layer to shrink its size back to sentence len X embeding size (512) this is the final output of final attension layer

Position-wise Feedforward Layer

Another key component in the Transformer block is called position-wise feed-forward network (FFN) this is relatively simple compared to the multi-head attention layer. It accepts a 3-dimensional input with shape [batch size, sequence length, hid dimension].

The position-wise FFN consists of two dense layers. Since the same two dense layers are used for each position item in the sequence, we referred to it as position-wise. It is equivalent to applying two 1×1 convolution layers.

The input is transformed from hid_dim to pf_dim, where pf_dim is usually a lot larger than hid_dim. The ReLU activation function and dropout are applied before it is transformed back into a hid_dim representation.

Decoder

Transformer Decoder is almost similar to the Encoder block.

Besides the two sub-layers (the multi-head attention layer and the positional encoding network), the decoder Transformer block contains a third sub-layer, which applies multi-head attention on the output of the encoder stack.

Similar to the Transformer encoder block, the Transformer decoder block employs ‘Add and norm’, i.e., the residual connections and the layer normalization to connect each of the sub-layers.

In general, the encoder takes in the input text and converts them into some vectorized representation. The decoder then takes this representation and converts it into new text. One major difference between the encoder and the decoder is that while the encoder takes just one input i.e the src text. the decoder takes two the first being the output of the encoder and the second being the output text that has been generated thus far.

  • After we get the output of the encoder, we split it into two copies these are the key (enc_src) and the value(enc_src) copies.
  • We usually assume that the first word the decoder generates is the special token indicating the start of the generated sentence. We feed this token as the first input to the decoder. From here it travels into a embedding layer which converts it into a vector and then we add positional information to it and pass it on to the multi head attention layer.
  • The output of the attention layer then goes onto the ‘Add & Norm’ block, this gives us a matrix and finally we send our this matrix to the decoders second multi-head attention module as a query. This second multi-head attention module takes three inputs, first being the key and value matrices that came from the encoder and the second being the query matrix that comes from the previously generated text sequence.
  • The output of the multihead attention flows forward to a positionwise feed forward and another ‘Add & Norm’ block.
  • The output of this is passed to the final linear layer, the size of the layer depends on the number of classes to predict. In case of machine translation, the size of the final layer is the target vocabulary size.
  • The output of linear layer is then passed to the softmax layer and squash the values between 0 and 1. Finally, we can pick the word which has highest softmax value. Note – In PyTorch, the softmax operation is contained within our loss function, so we do not explicitly need to use a softmax layer here.

Masking

The main use of the masking module comes to use during the training phase, unlike inference where we donot know the answers beforehand. During the training phase the model gets provided with both the source sentence as well as translated sentence, this allows it to learn from its mistakes. During training, the model gets fed with the source sequence as well as the correct target sequence that it should translate to, however we mask the target sequence. Why? well a teacher doesnot show you all the answers during a practice exam, you are first required to use your own mind and comeup with the own answers. It is only then the teacher tells you how well you did and provides you with the correct answers. This way you can better learn from your mistakes. Similarly after the input text gets passed into the transformers the decoder generates it first word, we then unmask and show the actual value it should have generated instead. Masking is a filter matrix where all the future words posses a score of 0.

Inference

During inference we

  • tokenize the source sentence if it has not been tokenized (is a string)
  • append the <sos> and <eos> tokens
  • numericalize the source sentence
  • This is passed to the encoder to generate vectorised representation
  • Decoder first consumes this encoder output and we also pass the first generated word which is a special token indicating the start of the generated text.
  • This token is passed to the output embedding layer and then the position information is added to it.
  • Finally, we pass it to the remaining layers and have the next word generated. – We follow the same steps for the newly generated word, except now the decoder consumes both the first and the second word. Using these we get the third word in the sequence. This process continues until the decoder generates the special end token

The code for this can be found here.

Thanks for reading!

Posted in NLP.

Leave a comment