Long Short-Term Memory

dl_model_lstm.png
Source: Wikipedia.org

Long Short-Term Memory (LSTM) is a type of Recurrent Neural Networks architecture used in the field of deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections that make it a "general purpose computer" (it can compute anything that a Turing machine can). It can not only process single data points (such as images), but also entire sequences of data (such as speech or video).

What is LSTM?

LSTMs are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber in 1997, and were refined and popularized by many people in the following work. They work tremendously well on a large variety of problems, and are now widely used.

Furthermore, they are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!

All Recurrent Neural Networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a basic structure, such as a single Tanh layer.

The Core Idea Behind LSTMs

The key to LSTMs is the cell state , the horizontal line running through the top of the diagram.

The cell state is like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s easy for information to just flow along it unchanged.

The LSTM has the ability to remove or add information to the cell state, carefully regulated by structures called gates.

Gates are a way to optionally let information through. They are composed out of a Sigmoid neural net layer and a pointwise multiplication operation.

An LSTM has three of these gates, to protect and control the cell state.

Step-by-Step LSTM Walk Through

  1. First step in our LSTM is to decide what information we’re going to throw away from the cell state. This decision is made by a sigmoid layer called the “forget gate layer.”
f_t = sigmoid(W_f.concat(h_{t-1}, x_t) + b_f)
  1. The next step is to decide what new information we’re going to store in the cell state.
i_t = sigmoid(W_i.concat(h_{t-1}, x_t) + b_i) 
C_t_hat = tanh(W_C.concat(h_{t-1}, x_t) + b_C)
  1. Subsequently, we update the old cell state , into the new cell state .
C_t = f_t * C_{t-1} + i_t * C_t_hat
  1. Finally, we decide what we're going to output. This output will be based on our cell state, but will be a filtered version.
o_t = sigmoid(W_o.concat(h_{t-1}, x_t) + b_o)
h_t = o_t * tanh(C_t)

Advantages of LSTMs

LSTMs have the advantage of learning long-term dependencies. That is, they can remember information for long periods of time as a default behavior — something many other algorithms can't do.

LSTMs are also very flexible and can be combined in many ways to achieve different results. For example, you can stack multiple LSTM layers into a deep LSTM network.

Disadvantages of LSTMs

However, LSTMs can be quite tricky to train properly. This is due to the fact that they are sensitive to the specific settings of hyperparameters and require careful initialization.

Secondly, LSTMs also suffer from a lack of interpretability; they are often thought of as "black boxes" because it's difficult to understand what's happening inside them.