Before reading this blog article, if I ask you what a Neural Network is, will you be able to answer? Learning about Deep Learning algorithms is a good thing, but it is more important to have your basics clear. Please go through Neural Network tutorial (Blog), if you have not done so already.
Once you have read the Neural Network Tutorial, let’s dive into the Recurrent Neural Network.
Free Step-by-step Guide To Become A Data Scientist
Subscribe and get this detailed guide absolutely FREE
What Is A Recurrent Neural Network?
Simply put, a Recurrent Neural Networks (RNN) is a class of the Artificial Neural Network.
What Differentiates A Recurrent Neural Network From A Traditional Neural Network?
In a traditional Neural Network, all inputs (and outputs) are assumed to be independent of each other. This is not the case with a Recurrent Neural Network. In a Recurrent Neural Network inputs or outputs are dependent.
Why Do We Need Input Or Output To Be Dependent?
Consider an example where you want to predict the next word in a sentence:
“Ram lives in India. He speaks fluent ……”
What will make a good prediction is if you better know “He” is related to Ram and the country he lives in is India. Given this context, the suitable word according to me is “Hindi”. If you don’t know the first sentence (Ram lives in India), it would be difficult to predict the word “Hindi”, isn’t it?
Why Is It Called Recurrent Network?
RNNs are called recurrent because they perform the same task for every element of a sequence, with the output being dependent on the previous computations. We can assume RNNs to be Neural Networks that have a “memory” which captures information about what has been calculated so far.
Back Propagation vs. Back Propagation Through Time:
To be honest, I do not see any difference between Back Propagation and Back Propagation Through Time, since both use the same underlying algorithm (i.e., the chain rule applied to the underlying computation graph, or so to say a neural architecture, to calculate gradients of a loss function with respect to parts of the graph, especially parameters).
The reason it is called “Back-Propagation Through Time” is to signify that this algorithm is being applied to a temporal neural model (Recurrent Neural Network or RNN) and nothing else.
What happens in an RNN is, we unfold an RNN over so many time steps or elements in a sequence (shared parameters over each step) to create one very deep (in time) Neural Network. You can think of it in this way: we unfold it over a variable number of time steps (according to the number of elements that come before the target to be predicted). This unfolding procedure essentially is what “Back Propagation Through Time” refers to. Alternatively, Back-Propagation Through Time effectively applies classical Back-Propagation of Errors to RNNs.
Let’s say, if the sequence we are talking about is a sentence of 5 words, the network would be unrolled into a 5-layer Neural Network – one layer for each word. The formulae that govern the computation happening in a RNN are as follows:
- Xt is input at time step t.
- St is the hidden state at time step t. It’s the “memory” of the network. It is calculated based on the previous hidden state and the input at the current step.
- Ot is the output at step t. For example, if we wanted to predict the next word in a sentence it would be a vector of probabilities across our vocabulary.
In short, the main feature of an RNN is its hidden state, which captures some information about a sequence and uses it accordingly whenever needed.
Vanishing/Exploding Gradient Problem:
In theory, RNN can handle a large sequence very effectively. However it is not the case once we start applying it. In case of multiple layers, the first layer will map a large input region to a smaller output region, which will be mapped to an even smaller region by the second layer, and so on. As a result, even a large change in the parameters of the first layer doesn’t change the output much. If a change in the parameter’s value causes a very small change in the network’s output, the network just can’t learn the parameter effectively. This is the Vanishing Gradient Problem.
You can think of it in this way:
As we have discussed earlier, a Recurrent Neural Network performs a transformation to its state at each time step. Now, since the network repeatedly uses the same weight matrix, applied transformation is same at each time step. Since the applied inverse transformations are coupled (related), either scaling up or scaling down happens. Therefore, the same inverse transformation is applied to the loss. This makes it much more likely for the gradients to vanish or explode.
One way to deal with this problem is to encourage the transformation that is applied to the states to roughly preserve the scale. That’s why Long Short-Term Memories (LSTMs) compute the next context state by multiplying the previous context state with the forget gate, a scalar very close to 1.
Long Short-Term Memory And Gated Recurrent Unit
To solve the problem of Vanishing Gradient, we use modified versions of RNNs – Gated Recurrent Unit (GRU) and Long Short-Term Memory (LSTM).
The LSTM can remove or add information to the cell state, carefully regulated by structures called gates.
The GRU unit on the other hand controls the flow of information like the LSTM unit, but without having to use a memory unit. It just exposes the full hidden content without any control.
Let’s talk about LSTM and GRU in detail.
Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a point wise multiplication operation. The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!” An LSTM has three of these gates. An “input” gate controls the extent to which a new value flows into the memory; a “forget” gate controls the extent to which a value remains in memory; and an “output” gate controls the extent to which the value in memory is used to compute the output activation of the block, to protect and control the cell state (information flows along it).
Step By Step Walkthrough Of LSTM:
The first step in LSTM is to decide what information you are going to throw away from the cell state. This decision is made by a sigmoid layer called the “forget gate layer.” It gives a value between 0 and 1, where a 1 represents “keep this as it is” while a 0 represents “get rid of this.”
Next, we must decide what new information we’re going to store in the cell state. This step has two parts: first, a sigmoid layer called the “input gate layer” decides which values we’ll update. Next, a tanh layer creates a vector of new candidate values that could be added to the state. In the next step, by combining these two layers, a new update is being created.
It is now time to update the old cell state, Ct−1, into the new cell state Ct. The last step has already created an update. We only need to update it.
Finally, we need to decide what we’re going to output based on the context that we have selected.
This is it as far as LSTM is concerned. Today many people use the LSTM instead of the basic RNN and they work tremendously well on a diverse set of problems. Most remarkable results are achieved with LSTM rather than on RNN and now this phenomenon has extended to such a level that when someone is talking or using RNN, he actually means LSTM.
There are a large number of variations of LSTM that are used today. One such reasonable variation of the LSTM is the Gated Recurrent Unit, or GRU. It combines the forget and input gates into a single “update gate”. It also merges the cell state and hidden state, and makes some other changes in the way the output is given. The resulting model is simpler than standard LSTM models, and has been quite well received in the Data Science community.
It has been observed that LSTM works better for a large number of datasets while GRU works better for a small number of datasets. However, there is no hard and fast rule as such.
Limitations of RNN:
We have already observed that a simple RNN struggles through the problem of Vanishing Gradient. This is the reason why LSTM is being introduced. Now, are there any limitations to LSTM?
The answer is YES.
Apart from the fact that it is quite complex to understand at first, , LSTM is slower than other normal models. With careful initialization and training, a simple RNN can perform on par with LSTM, with less computational complexity. When recent information is more important than old information, there is no doubt that the LSTM model is a better choice. However, you will find that there are problems where you want to go into deep past; in such cases a new mechanism called “attention mechanism” is becoming popular. A slightly modified version of this model is called the “Recurrent Weighted Average Network”. We will discuss Weighted Average Network in another article.
The Future Of Recurrent Neural Network
One more shortcoming of conventional LSTMs is that they are only able to make use of previous context. There is a new variation becoming quite popular, that is Bidirectional RNNs (BRNNs). They process data into both directions using two separate hidden layers. Combining these two layers will give you complete information about the context. BRNNs have been successfully used in speech recognition models.
The sequence-to-sequence LSTM, also called Encoder-Decoder LSTMs (a combination of two LSTMs), are an application of LSTMs that are receiving a lot of attention given their impressive capability in Question-Answer models (chatbots).
Time series prediction and anomaly detection is another area where RNN (LSTM) seems quite promising. Given these wide range of problem sets where RNN can be applied quite effectively, the future of RNN seems quite bright.
If you want to work in the field of Natural Language Processing, it becomes almost imperative to learn Recurrent Neural Networks.
Look for the implementation of RNN in another blog article.