Convolutional Sequence to Sequence Learning

Traditionally, Recurrent neural networks (RNNs) with LSTM or GRU units are the most prevalent tools for NLP researchers, and provide state-of-the-art results on many different NLP tasks, including language modeling (LM), neural machine translation (NMT), sentiment analysis, and so on. However, a major drawback of RNNs is that since each word in the input sequence are processed sequentially, they are slow to train.

Lately, Convolutional Neural Networks – traditionally used in solving most of Computer Vision problem, have also found prevalence in tackling problems associated with NLP tasks like Sentence Classification, Text Classification, Sentiment Analysis, Text Summarization, Machine Translation and Answer Relations.

Back in 2017, a team of researchers from Facebook AI research released an interesting paper about Sequence to Sequence learning with Convolutional neural networks(CNNs), where they tried to apply CNNs to problems in Natural Language Processing.

In this post, I’ll try to summarize this paper on how CNN’s are being used in machine translation.

What are Convolutional Neural Networks and their effectiveness for NLP?

Convolutional Neural Networks (CNNs) were originally designed to perform deep learning for computer vision tasks, and have proven highly effective. They use the concept of “convolution”, a sliding window or “filter” that passes over the image, identifying important features and analyzing them one at a time, then reducing them down to their essential characteristics, and repeating the process.

Now, lets see how CNN process can be applied to NLP.

Neural networks can only learn to find patterns in numerical data and so, before we feed a text into a neural network as input, we have to convert each word into a numerical value. It starts with an input sentence broken up into words and transformed to word embeddings (low-dimensional representations generated by models like word2vec or GloVe or by using a custom embedding layer). The text is organized into a matrix, with each row representing a word embedding for the word. The CNN’s convolutional layer “scans” the text like it would an image, breaks it down into feature.

The following image illustrates how the convolutional “filter” slides over a sentence, three words at a time. This is called a 1D convolution because the kernel is moving in only one dimension. It computes an element-wise product of the weights of each word, multiplied by the weights assigned to the convolutional filter. The resultant output will be a feature vector that contains about as many values as there were in input embeddings, so the input sequence size does matter.

A convolutional neural network will include many of these kernels (filters), and, as the network trains, these kernel weights are learned. Each kernel is designed to look at a word, and surrounding word(s) in a sequential window, and output a value that captures something (some context) about that phrase. In this way, the convolution operation can be viewed as window-based feature extraction.

Before we delve deep on the model architecture as mentioned in the paper. Lets first look at the general RNN based encoder-decoder model. (Image Source is from here )

We use our encoder (green) over the embedded source sequence (yellow) to create a context vector (red). We then use that context vector with the decoder (blue) and a linear layer (purple) to generate the target sentence.

How convolutional sequence to sequence model work?

The architecture proposed by authors for sequence to sequence modeling is entirely convolutional. Below diagram outlines the structure of convolutional sequence to sequence model.

Like any RNN based sequence to sequence structure CNN based model uses encoder decoder architecture, however here both encoder and decoder are composed of stacked convolutional layers with a special type of activation function called Gated Linear Units. In the middle there is a attention function. The encoder extracts features from the source sequence, while decoder learns to estimate the function that maps the encoders hidden state and its previous generated words to the next word. The attention tells the decoder which hidden states of the encoder to focus on.

A concept of positional embedding, is been introduced in this model. Well, what do we mean by positional embedding?

In CNN, we process all the words in a sequence simultaneously, it is impossible to capture the sequence order information like we do in RNNs (a timeseries based model). In order to use the sequence information of the sentence (sequence of words), the absolute position information of the tokens needs to be injected into the model and we need to explicitly send this information to the network. This works just like a regular word embedding but instead of mapping words, it maps the absolute position of a word to a dense vector. The position embedding output will be added on the word embedding. With this additional information, the model knows which part of the context it is handling.

The word goes through its embedding layer, the position of the word goes to another/(its own) embedding layer and they are added up.

No alt text provided for this image

The paper also applies residual connection (skip connection) between the blocks in both the encoder and the decoder, which allows for deeper convolutional network.

Why residual connection?

Models with many layers often rely on shortcut or residual connections. When we stack up convolutional layers to form a deeper model, it becomes harder and harder to optimize since the model has a lot of parameters, resulting in poor performance and also the gradient values start exploding and becomes very difficult to handle. This is solved by adding a residual block (skip connections) i.e to add the previous blocks output onto the current block directly. This technique makes the learning process easier and faster, enabling the model to go deeper, also helps improve the accuracy.

In simple terms, each layer will have values in different ranges, one could be in tens while another in thousands. During Back Propagation, skip connections(residual connections) will help the optimizer change values relative to its intensity, rather than flowing from layer to layer changing values in very small amounts.

Encoder

Let’s now have a closer look at the Encoder structure.

No alt text provided for this image

Lets say we are building a model on machine translation from German to English.

  • We take the German sentence add padding at the start and end of the sentence and split those into tokens. This is because, the CNN layer is going to reduce the length of the sentence, to maintain same sentence length we add padding.
  • We first send it to the embedding layer to get the word embedding, we also need to encode the position of the word, so we will be literally sending the position(index position of the words) to another similar embedding layer to get the positional embeddings.
  • We then do a element wise sum of word embedding and positional embedding, that is going to result in a combined embedding which is a element wise vector (This layer knowns the word and also has encoded even the location of the word) .
  • This vector goes into a Fully connected layer because we need to convert it into a particular dimension and also, to help increase capacity and extract information, basically to convert these simple numbers into something which is more complex (like rearranging of features).
  • The output of each of these FC layer is simultaneously sent to the convolution blocks.
  • For each of the information that is going into the convolutional block, we are going to get individual outputs.
  • This output is again sent to another fully connected layer, because the output of the convolution need to be converted into the embedding dimension of the encoder.
  • The final vector will have the embedding equal to the number of dimension we want.
  • We also add a skip connection the output of the final FC layer gets added with the element wise sum of word and position embedding , i.e. we send the whole word along with the position of the word to the decoder as there is a possibility of convolutional layer to loose the positional information.

Finally, from the encoder block we will be sending two outputs to the decoder, one is the conved output and another is combined vector (which is combination of transformed vector and embedding vector) i.e. suppose we have 6 tokens, we will be getting 12 context vectors, 2 context vectors per token, one from conved and another from combined.

Encoder Convolutional Block

Lets now see the convolutional block within encoder architecture

No alt text provided for this image
  • The main difference here is, unlike in images all our convolutions in NLP are 1-D rather than 2-D in images, where they are more famous for.
  • As mentioned earlier we pass the padded input sentence to the CNN block, this is because the CNN layer is going to reduce the length of the sentence, and we need to ensure that the length of the input sentence getting into the convolution block is equal to the length of the sentence going out of the convolution block.
  • We will be then convolving the padded output using a kernel size of 3 (odd size)
  • The output of this is sent to a special kind of activation GLU (Gated Linear Unit) activation.

How GLU activation works?

  1. The output of convolutional layer i.e input to GLU is split into two halves A and B, half the input (ie. A) would go to sigmoid and we would then do a element wise sum of both.
  2. Sigmoid acts as a gate, determining which part of B are relevant to the current context. The larger the values of entry in A the more important that corresponding entry in B is. The gating mechanism of the models enables to select the effective parts of the input features in predicting the next words. Since the GLU is going to reduce its output length to half that of the input, we would be doubling the output of the convolution layers to make up for it.
  • To the output of the GLU we add a residual connection i.e the input to the convolution layer is going to be added as a skip connection to the output of the GLU, this done just to avoid any issues associated with the convolutional layers, this skip connections ensures smooth flow of gradients

This concludes a single convolutional block. Subsequent blocks take the output of the previous block and perform the same steps. Each block has their own parameters, they are not shared between blocks. The output of the last block goes back to the main encoder – where it is fed through a linear layer to get the conved output and then elementwise summed with the embedding of the token to get the combined output.

Decoder

The Decoder is very similar to Encoder, but with few changes.

No alt text provided for this image
  • In the decoder, we will be passing the whole target sentence (not word by word) for the prediction. Like in encoder, we will first pass the tokens to the embedding layer to get the word and positional embedding.
  • Add both the word and positional embedding using element wise sum, pass it to the fully connected layer, which then goes to the convolutional layer.
  • The convolutional layer accepts two additional inputs i.e the encoder conved and encoder combined (this is to feed encoder information into the decoder), we also pass the embedding vector as a residual connection to the convolution layer. Unlike the encoder, the residual connection or skip connection goes only to the convolution block it does not go to the output of the convolution block because we have to use the information to predict the output.
  • This goes to two layer linear network (FC layer) to make the final predicti

Decoder Convolution Blocks

Let’s now see the decoder convolutional blocks, this is similar to the one within encoder. However there are few changes.

No alt text provided for this image

For encoder the input sequence is padded so that the input and output lengths are the same and we would pad the target sentence in the decoder aswell for the same reason. However, for decoder we only pad at the beginning of the sentence, the padding makes sure the target of the decoder is shifted by one word from its input. Since we are processing all the target sequence simultaneously, so we need a method of not only allowing the filter to translate the token that we have to the next stage, but we also need to make sure that the model will not learn to output the next word in the sequence by directly copying the next word, without actually learning how to translate.

If we don’t pad it at the beginning (as shown below), then the model will see the next word while convolving and would literally be copying that to the output, without learning to translate

No alt text provided for this image

Attention

The model also adds a attention in every decoder layer and demonstrate that each attention layer only adds a negligible amount of overhead. The model uses both encoder conved and encoder combined, to figure out where exactly the encoder want the model to focus on while making the prediction

  • Firstly, we take the conved output of a word from the decoder, do a element wise sum with the decoder input embedding to generate combined embedding
  • Next, we calculate the attention between the above generated combined embedding and the encoder conved, to find how much it matches with the encoded conved
  • Then, this is used to calculate the weighted sum over the encoded combined to apply the attention.
  • This is then projected back up to the hidden dimension size and a residual connection to the initial input is applied to the attention layer.

This can be seen as attention with multiple ’hops’ compared to single step attention.

Compared to RNN models convolution models have two advantages:

  • First, it runs faster because convolution can be performed in parallel. By contrast, RNN needs to wait for the value of the previous timesteps to be computed.
  • Second, it captures dependencies of different lengths between the words easily. In a group of stacked CNN layers, the bottom layers captures closer dependencies while the top layers extract longer (complex) dependencies between words.

Having said that when comparing RNN vs CNN, both are commonplace in the field of Deep Learning. Each architecture has advantages and disadvantages that are dependent upon the type of data that is being modeled.

From the abstract of the paper, the authors claim to outperform the accuracy of deep LSTMs in WMT’14 English-German and WMT’14 English-French translation at an order of magnitude faster speed, both on GPU and CPU.

References:

Convolutional Sequence to Sequence Learning by Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin
(Facebook AI Research)

Thanks for reading!

Posted in NLP.

Leave a comment