In our previous article we spoke about one of the most commonly used Neural Layers - Convolutional Layer and understood the inner workings of the same.
In today's article we will be diving in deeper into another one of the most prominent layers - LSTM (Long Short Term Memory) Layers. So before we set right off with it, let's try and understand Why do we need it?
Why do we need it?
As we have been mentioning time and time again, that every Layer has its own unique functionality, or in other terms -
Every Layer has an Affinity towards a Unique problem statement or Data Type.
Now as we noticed in our previous chapter, convolutional layers are primarily used for Feature Extraction or for Images and Audio files.
In the same manner, LSTMs are primarily used for Time Series type of problems or with Data Types those have dependencies flowing through them over time. Basically Sequential Models.
After reading the use case, the next obvious question in your mind would be, What do you mean by Time Series or Sequential Data?
Well simply put -
If Data can be broken down into Frames on the basis of Time and each Frame has a dependency on its neighbouring Frame, then the data is called Time Series or Sequential Data.
Now that we understood Why and Where do we need it. We should definitely take a look at What are they?
What are they?
Before we answer that question, we need to walk down the memory lane a bit. Yes I know it's ironical but still.
So originally before what came to be known as LSTM, these layers were an aggregation of neurons which people used to call Recurrent Neural Networks.
So What are RNNs (Recurrent Neural Network)?
Recurrent networks … have an internal state that can represent context information. … [they] keep information about past inputs for an amount of time that is not fixed a priori, but rather depends on its weights and on the input data.
A recurrent network whose inputs are not fixed but rather constitute an input sequence can be used to transform an input sequence into an output sequence while taking into account contextual information in a flexible way
-Yoshua Bengio, et al., Learning Long-Term Dependencies with Gradient Descent is Difficult, 1994.
So what we are basically trying to say is RNNs are a group of neurons which are looped together to persist information and pass it along to the next.
As you can clearly see from the above illustration that every Neuron - Blue is passing information after processing it within the Unit, into the next.
Now as every piece of information is processed and transformed at every stage by multiple neurons, naturally it tends to lose a lot of its Original context by the time it reaches the end of the chain.
Now theoretically speaking, RNNs should be able to retain all the information as they pass through information in sequence, but in reality it doesn't quite happen so.
Limitations of RNN
As we are aware that RNNs maintain a State, where the information persists and the Original data is processed and carried forward.
We all know, Context is key
But because of not having selective filtration on Feature Transformation, many a times it tends to happen that relevant information gets lost in the process.
Let's take an example to understand this better.
Suppose I have a sentence, that says - "The fish swims in water". Now to predict the word "WATER", there is enough context in that statement.
But what if I have a sentence like - "My name is Mark. I am from Italy..... I love Italian Food". Now in this case in order to predict "ITALIAN FOOD", one would need context of more than just what's there in the statement and many a times it happens that RNNs fail to retain such information and pass it forward.
So how do we solve it?
LSTM saved my Information
LSTMs theoretically leverage the same concept of having recurrent modules. But the catch with LSTMs is their transformation of information is not processed by a simple activation function such as Tanh, but rather it undergoes through various stages.
Need not worry, we will go through each one of them in detail. So what are these stages?
Stages of LSTM
The LSTM layer processes the information by selectively filtering information and transforming it across different stages, these stages are -
- Forget Gate Layer
- Input Gate Layer
- Output Gate Layer
- Cell State
Cell State - This is the Conveyor Belt of LSTMs. It carries information almost linearly across the repeated modules with some Linear Alterations caused by the various gates.
Forget Gate Layer - Every time a module receives the state of the previous cell, it uses the Sigmoid Function to decide How much of the Previous State or information should be retained and passed back into the cell state.
In this layer Sigmoid function returns a number between 0 and 1 which acts as the coefficient to the deciding the Amount of Information to be carried into cell state.
Now that you have decided which information to be retained from the previous state, it is only logical to introduce some Novelty in your features based on your Current state.
Input Gate Layer - Based on the received cell state, we go ahead and use another layer of Sigmoid function to decide which values we want to update using the same logic as before.
Then we use the Tanh function to introduce some novel features based on the current cell state.
Both of these values are then multiplied against each other, so that only the selected values get updated and are then added to the Conveyor Belt, basically the cell state.
Finally we need to decide what set of information gets passed into the next cell state as input.
Output Gate Layer - The output gate acts as an exact mirror image of Input Gate Layer. It uses a Sigmoid function to decide which parameters to pass forward. Then uses a Tanh function to normalise the features of cell state and then multiply it and pass it forward.
This same process gets repeated over and over again through the modules and each iteration during the training process of the Architecture, updates the weights and biases of the Functions in the different stages of the module.
LSTMs are very useful in retaining Long Term information as that is their default nature, but because of the selective filtering it also allows to retain relevant information for a Short Term from close neighbours.
Now if you have been able to follow through the article then, Congratulations, you have successfully been able to demystify and understand one of the most complex architectures in Neural Networks.
This article may not do justice to LSTM or Neural Networks, as this just explain the concepts at a high level. But to all the enthusiasts reading this post,
STAY TUNED for more content around the same. 😁