Sequential Data Processing in NLP

We humans have an amazing ability to rapidly interpret and put words into context while we exchange our thoughts and interact with others and the credit goes to the best computer we have ever know : A Human Brain.

Over the years Scientists have carried out various research and have found that it involves a huge and complex set of computations in the brain. Recent advances in deep learning have produced state-of-the-art results and have made it possible for machines to perform similar to human in various natural language processing tasks like text generation, language translation, sentiment analysis , cloze test, speech recognition etc.

There are various state-of-the-art techniques in NLP which have evolved over the years from vanilla recurrent neural neural networks (RNN’s) to Long Short-Term Memory Networks (LSTM’s)/Gated Recurrent Unit Networks (GRU), leading upto the “Attention” model architectures and several variants of the same, and some revolutionary new models like GPT2/GPT3,

In this article, I would like to explain the architecture of few of these techniques LSTM, GRU and Attention mechanism. So lets start…

To illustrate the core ideas, let me first explain about Recurrent neural networks (RNNs) before moving on to its variants LSTM and GRU.

Recurrent Neural Network:

Recurrent Neural Network (RNN’s) are a class of neural networks which are often used with sequential data.

What is sequential data? Suppose we consider the word “Apple”, it does not say what exactly are we talking about? It could be Apple as a product, as a company or a fruit, unless we get some sequence of text or characters we will not be able to comprehend what exactly is being said.

The traditional neural networks which are fully connected (FC) proved successful in many tasks, however they are not great in capturing the sequence, as they don’t have any mechanism to store memory. Lets say there is a deeper network with one input layer, two hidden layers and one output layer. Then each hidden layer will have its own set of weights and biases. This means that each of these layers are independent of each other, i.e. they do not memorize the previous outputs.

This is something which RNN tried to solve to some extent, by adding a looping mechanism that allows the information to flow from one step to another. RNN converts the independent activations into dependent activations by providing the same weights and biases to all the layers, thus memorizing each previous outputs, by giving each output as input to the next hidden layer. Hence the above layers can be joined together such that the weights and bias of all the hidden layers is the same, all these hidden layers can be rolled in together into a single recurrent layer.

In the above diagram, we have input coming in, passed on to the hidden layer which is actually a fully connected layer, the hidden layer will have some neuron which is going to multiply and the output is generated, then we take the output and feed it to itself. In short, RNN is a fully connected layer which feeds the input back to itself.

Hidden State in RNN Cell

First, the input and previous hidden state are combined to form a vector. That vector now has information on the current input and previous inputs. The vector goes through the tanh activation (tanh function squashes values to always be between -1 and 1, this helps regulate the values flowing through the network by not allowing the values to explode as there will me many weights which are multiplied with each other and added between each other in the layers, tan will act as normalization), and the output is the new hidden state, or the memory of the network.

Lets see how RNN processed a text, consider we have a sentence and we have to predict the sentiment (positive/negative)- “The movie is great!”. First we break up this sentence into individual words. Since RNN’s work sequentially we pass one word at a time? Since NN takes only numerical data, the words are converted into embedding vectors of specific dimension (say 100, 200 etc), however for now lets not complicate things, we will assume that words are passed as is.

The first step is to feed “The” into the RNN. The RNN encodes “The” and produces an output, but this output is not what we are interested in, we are only interested in the last output. But this output is fed to the second RNN along with the second word “movie”. The RNN now has information on both the word “The” and “movie”. We repeat this process, until all the text is passed and the final output [04] is then passed to another fully connected layer which is outside of the RNN to predict the sentiment.

Recurrent Neural Networks performed better than fully connected (FC) network. However, they suffer from short-term memory. If a sequence is long enough, they’ll have a hard time carrying information from earlier time steps to later ones. So if you are trying to process a paragraph of text to do predictions, “The movie is a horrible experience of unbearable length, utterly worthless. I walked out of the film after two hours of its 170-minute length. It is the worst film I have ever seen, never ever watch this movie”.

RNN’s may leave out important information from the beginning and just consider the last few words “ever watch this movie” to make the prediction. We get a much bigger context of what is happening right now, but we lose the context of the initial information which comes in. This is called as short-term memory and its is caused by vanishing gradient problem during back propagation. Gradients are values used to update a neural networks weights. The vanishing gradient problem is when the gradient shrinks as it back propagates through time. If a gradient value becomes extremely small, it doesn’t contribute too much learning on the initial set of elements.

Long Short-term Memory(LSTM)/GRU

Long short-term memory and GRU networks are an extension for recurrent neural networks, it combats the problem of short term memory, they enable RNNs to remember inputs over a long period of time by basically extending the memory. This memory stores the contextual information, it can be seen as a gated cell, that can regulate the flow of information whether or not to store or delete information, based on the importance it assigns to the information. The assigning of importance happens through weights, which are also learned by the algorithm. This simply means that it learns over time what information is important and what is not. Also a parallel sequence of information is maintained using which it could back propagate properly without causing vanishing gradient problem . By doing that, it can pass relevant information down the long chain of sequences to make predictions.

Lets now see, how memory is added to LSTM cell, it has three gates — forget gate, input gate and output gate.

Forget Gate — Decides what information should be thrown away or kept. Information from the previous hidden state and information from the current input is passed through the Sigmoid activation . Sigmoid activation is similar to the tanh. Instead of squishing values between -1 and 1, it squishes values between 0 and 1. If we want any information to go forward (“to be remembered”) we multiply the number with 1, if we don’t want the number to go forward (“to be forgotten”) we multiply with zero. Basically, we use sigmoid to act as a router which controls what goes out and what doesn’t go out.

Input Gate — To update the cell state, we have the input gate. First, we pass the previous hidden state and current input into a sigmoid function. That decides which values will be updated by transforming the values to be between 0 and 1. 0 means not important, and 1 means important. You also pass the hidden state and current input into the tanh function to squish values between -1 and 1 to help regulate the network. Then you multiply the tanh output with the sigmoid output. The sigmoid output will decide which information is important to keep from the tanh output.

Cell State — This is nothing but the context vector. First, the cell state gets elementwise multiplied by the forget vector, to drop any information from the context vector. Then we take the output from the input gate and do a elementwise addition which adds new information that the neural network finds relevant. That gives us our new cell state/context vector.

Output Gate — The output gate decides what the next hidden state should be. Remember that the hidden state contains information on previous inputs. The hidden state is also used for predictions. First, we pass the previous hidden state and the current input into a sigmoid function. Then we pass the newly modified cell state to the tanh function. We multiply the tanh output with the sigmoid output to decide what information the hidden state should carry. The output is the hidden state. The new cell state and the new hidden is then carried over to the next time step.

LSTM Cell

  • First concatenate last output (hidden state) vector and the input vector and then we have one single concatenated vector, we first pass it to the forget gate (sigmoid is used) to forget the information from the memory.
  • Next, the same concatenated vector is passed to input gate to add information to the context vector. However, we would like to add only the relevant information, i.e because from the input vector and the last output vector(previous hidden state) we would not want to add all the information, we take out the information which might not be relevant right now (in the current time step).
  • Context vector (Cell State) has all the information plus the memory, and we will be maintaining that end to end, this goes to the backpropagation through the output and then goes back to the first input also.
  • Cell state should have be used directly as the output, but we cannot since it has the entire memory. Before we pass the data to output we remove the stuff that we don’t want to go out or not relevant for current timestep, so we first pass the context vector through a tanh gate that takes a tanh on embeddings since the output of a neural network should be between +1 and -1 and context vectors are allowed to maintain the amplitude, but the input and output are not. Then we have a sigmoid gate again looking at the input and the previous hidden state and removes the memory which we don’t want to pass to output vector.

GRU

A slightly more dramatic variant on the LSTM is the Gated Recurrent Unit or GRU.

  • It combines the forget and input gates into a single “update gate”. It decides what information to throw away and what new information to add.
  • It also merges the cell state and hidden state
  • It updates the memory twice, the first time (using old state and new input, called Reset Gate) and the second time (as final output). It decides how much past information to forget.
  • Old cell state or hidden state (with input) is used for its own update as well as for deciding

GRU has been growing increasingly popular and is a default alternative for LSTMs. GRU are used slightly more due training speed as there are less number of computations. GRU is simple than LSTM, however they are not proven to be better than LSTM and neither LSTM have been proven to be better than GRU. There isn’t a clear winner which one is better. Researchers and engineers usually try both to determine which one works better for their use case.

Encode decoder Architecture in RNN/LSTMs

Most of the LSTM or RNN implementations that we have seen above is that we are translating word by word at each timestep and sometimes the word translation or the context is not dependent on immediate input but dependent on all the outputs or the inputs which came in the past. For instance in language translation suppose we have to translate “How are you?” to Hindi, the translation depends on the words being used or the position of the words in the sentence rather than direct word by word translation.

In such cases, the right approach could be to pass the entire input sequence to a LSTM/GRU layer (encoder), use the output of last encoder hidden state convert into a context vector and then feed that to another LSTM/GRU layer(decoder), which can be used to look at the complete context that is coming in and make the translation. This is called encoder decoder architecture.

The problem with this architecture is if there a long input sentence then the single context vector “c” needs to capture every thing which can cause information loss while decoder tries to decipher the passed information from this single vector, to solve this problem next came attention mechanism.

RNN/LSTM with an Attention Mechanism

In attention mechanism, instead of passing a single context vector to decoder from the last encoder step, we also pass a vector representation from every encoder time step to the decoder. This helps the network to focus on right word of the input sequence so that it can make appropriate translations

All the encoded states from the encoder along with the context go to the attention block, this attention block gives context by predicting some sort of number called alpha which indicates the amplitude of all the encoded states going forward . This alpha is passed to softmax and is then multiplied with the encoded hidden states, which is then aggregated to get the context vector. Attention places different focus on different words by assigning each word with a alpha score.

Let’s us consider the same example as above “How are you?”. First, we have LSTM/GRU layer, which predicts the encoder hidden states h1,h2,h3 which is then passed to attention layer along with first decoder hidden state s0 which predicts a1,a2,a3. These alphas are passed to softmax and then multiplied with corresponding encoder hidden state h1,h2,h3 values, then these vectors are summed up and this becomes our first context vector c1. This context vector c1 is then passed on to decoder. Decoder takes the first context vector, zero vector s0 and predicts the first word y1 and also outputs a decoder hidden state s1.

Next, again the encoder hidden states h1,h2,h3 are concatenated with the second decoder hidden state s1 to predict another set of alpha values a1,a2,a3. These new alphas are are passed to softmax then multiplied with corresponding encoder hidden state h1,h2,h3 values, and summed up to get second context vector c2. This context vector c2 is passed on to decoder along with s1 to predict y2.

We repeat the same process and in the end, we have consumed 3 vectors, and with attention focused on a specific vector, the prediction will be made for each step so that we get well-formed translations.

With the introduction of some revolutionary models like GPT3, RNN’s / LSTM’s and GRU are no longer the de facto algorithms being used, however their relevance still exists!!!

Hope this was helpful, Thanks for reading!

Posted in NLP.

Leave a comment