Sequential Modeling: RNN, LSTM, and GRU
Definition
Recurrent Neural Networks (RNNs) are neural architectures designed to process sequential data by maintaining hidden state that captures information from previous time steps. Unlike feedforward networks, RNNs share parameters across time steps, making them suitable for variable-length inputs. However, standard RNNs suffer from vanishing gradients when learning long-term dependencies. Long Short-Term Memory (LSTM), introduced by Hochreiter and Schmidhuber in 1997, solved this through a sophisticated gating mechanism with forget, input, and output gates. Gated Recurrent Unit (GRU), proposed by Cho in 2014, simplified this with update and reset gates while maintaining similar performance. These architectures dominated sequence modeling until Transformers emerged, and remain important for tasks with strong sequential inductive bias or limited compute.
Intuition
Imagine reading a book one word at a time while trying to remember what happened earlier. A simple RNN is like a person with limited working memory - they can only remember the immediate context and quickly forget what happened chapters ago. LSTM is like a well-organized filing system: it has a 'forget gate' to discard irrelevant information (like the color of a minor character's shirt), an 'input gate' to store important new details (like discovering the villain's identity), and an 'output gate' to decide what to use right now. The cell state acts as a conveyor belt running through time, allowing information to flow unchanged when needed. GRU simplifies this by combining the forget and input gates into an 'update gate' - instead of separately deciding what to forget and what to remember, it decides what fraction of the new information to blend with old memories. Both give the network selective memory, solving the vanishing gradient problem.
Mathematical Formula
Step-by-Step Explanation:
- RNN: Hidden state updates via tanh of linear combination of previous state and current input
- LSTM Forget Gate \(f_t\): Sigmoid output 0-1 determining what fraction of cell state to keep
- LSTM Input Gate \(i_t\): Sigmoid determining what fraction of candidate values to store
- LSTM Candidate \(\tilde{C}_t\): New candidate values created from current input and previous hidden state
- LSTM Cell State \(C_t\): Linear interpolation between old state (forgotten) and new candidate (added)
- LSTM Output Gate \(o_t\): Sigmoid determining what fraction of cell state to output as hidden state
- GRU Update Gate \(z_t\): Controls balance between keeping old state and accepting new candidate
- GRU Reset Gate \(r_t\): Determines how much past information to forget when computing candidate
- BPTT: Sum gradients across time steps since weights are shared
Real-World Use Cases
Google Translate (pre-Transformer) using bidirectional LSTM encoder-decoder
Apple Siri using LSTM acoustic models for phoneme prediction
Character-level RNNs generating Shakespeare-style prose or code
Stock price forecasting and anomaly detection in financial markets
Google Magenta using LSTM to compose original melodies
BiLSTM-CRF models extracting person names and organizations from text
Implementation
Manual Implementation (No Libraries)
import numpy as np
class LSTMCell:
def __init__(self, input_size, hidden_size):
self.input_size = input_size
self.hidden_size = hidden_size
# Xavier initialization
concat_size = input_size + hidden_size
self.W_f = np.random.randn(concat_size, hidden_size) * np.sqrt(2.0 / concat_size)
self.W_i = np.random.randn(concat_size, hidden_size) * np.sqrt(2.0 / concat_size)
self.W_c = np.random.randn(concat_size, hidden_size) * np.sqrt(2.0 / concat_size)
self.W_o = np.random.randn(concat_size, hidden_size) * np.sqrt(2.0 / concat_size)
self.b_f = np.zeros((1, hidden_size))
self.b_i = np.zeros((1, hidden_size))
self.b_c = np.zeros((1, hidden_size))
self.b_o = np.zeros((1, hidden_size))
def sigmoid(self, x):
return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
def tanh(self, x):
return np.tanh(x)
def forward(self, x_t, h_prev, c_prev):
# Concatenate previous hidden state and current input
concat = np.hstack([h_prev, x_t])
# Forget gate
f_t = self.sigmoid(np.dot(concat, self.W_f) + self.b_f)
# Input gate
i_t = self.sigmoid(np.dot(concat, self.W_i) + self.b_i)
# Candidate cell state
c_tilde = self.tanh(np.dot(concat, self.W_c) + self.b_c)
# Cell state update
c_t = f_t * c_prev + i_t * c_tilde
# Output gate
o_t = self.sigmoid(np.dot(concat, self.W_o) + self.b_o)
# Hidden state
h_t = o_t * self.tanh(c_t)
cache = (x_t, h_prev, c_prev, f_t, i_t, c_tilde, c_t, o_t, h_t, concat)
return h_t, c_t, cache
def backward(self, dh_t, dc_t, cache, learning_rate=0.01):
x_t, h_prev, c_prev, f_t, i_t, c_tilde, c_t, o_t, h_t, concat = cache
# Backprop through output gate
do_t = dh_t * np.tanh(c_t)
do_t_raw = do_t * o_t * (1 - o_t)
# Backprop through cell state
dc_t = dc_t + dh_t * o_t * (1 - np.tanh(c_t) ** 2)
# Backprop through gates
df_t = dc_t * c_prev
df_t_raw = df_t * f_t * (1 - f_t)
di_t = dc_t * c_tilde
di_t_raw = di_t * i_t * (1 - i_t)
dc_tilde = dc_t * i_t
dc_tilde_raw = dc_tilde * (1 - c_tilde ** 2)
# Gradients w.r.t. weights
dW_f = np.dot(concat.T, df_t_raw)
dW_i = np.dot(concat.T, di_t_raw)
dW_c = np.dot(concat.T, dc_tilde_raw)
dW_o = np.dot(concat.T, do_t_raw)
# Gradients for previous step
dconcat = (np.dot(df_t_raw, self.W_f.T) + np.dot(di_t_raw, self.W_i.T) +
np.dot(dc_tilde_raw, self.W_c.T) + np.dot(do_t_raw, self.W_o.T))
dh_prev = dconcat[:, :self.hidden_size]
dc_prev = dc_t * f_t
# Update weights
self.W_f -= learning_rate * dW_f
self.W_i -= learning_rate * dW_i
self.W_c -= learning_rate * dW_c
self.W_o -= learning_rate * dW_o
return dh_prev, dc_prev
class GRUCell:
def __init__(self, input_size, hidden_size):
self.input_size = input_size
self.hidden_size = hidden_size
concat_size = input_size + hidden_size
self.W_z = np.random.randn(concat_size, hidden_size) * np.sqrt(2.0 / concat_size)
self.W_r = np.random.randn(concat_size, hidden_size) * np.sqrt(2.0 / concat_size)
self.W = np.random.randn(concat_size, hidden_size) * np.sqrt(2.0 / concat_size)
self.b_z = np.zeros((1, hidden_size))
self.b_r = np.zeros((1, hidden_size))
self.b = np.zeros((1, hidden_size))
def sigmoid(self, x):
return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
def tanh(self, x):
return np.tanh(x)
def forward(self, x_t, h_prev):
concat = np.hstack([h_prev, x_t])
# Update gate
z_t = self.sigmoid(np.dot(concat, self.W_z) + self.b_z)
# Reset gate
r_t = self.sigmoid(np.dot(concat, self.W_r) + self.b_r)
# Candidate hidden state
concat_reset = np.hstack([r_t * h_prev, x_t])
h_tilde = self.tanh(np.dot(concat_reset, self.W) + self.b)
# Hidden state update
h_t = (1 - z_t) * h_prev + z_t * h_tilde
return h_t, (x_t, h_prev, z_t, r_t, h_tilde, concat, concat_reset)
# Demonstration
lstm = LSTMCell(input_size=10, hidden_size=20)
gru = GRUCell(input_size=10, hidden_size=20)
# Single time step\x_t = np.random.randn(1, 10)
h_prev = np.zeros((1, 20))
c_prev = np.zeros((1, 20))
h_lstm, c_lstm, _ = lstm.forward(x_t, h_prev, c_prev)
h_gru, _ = gru.forward(x_t, h_prev)
print(f'LSTM output shape: {h_lstm.shape}')
print(f'GRU output shape: {h_gru.shape}')
Using Libraries (torch, torch.nn, tensorflow, keras)
import torch
import torch.nn as nn
# PyTorch LSTM Implementation
class LSTMModel(nn.Module):
def __init__(self, input_size, hidden_size, num_layers, num_classes, dropout=0.5, bidirectional=False):
super(LSTMModel, self).__init__()
self.hidden_size = hidden_size
self.num_layers = num_layers
self.bidirectional = bidirectional
self.num_directions = 2 if bidirectional else 1
# LSTM layer
self.lstm = nn.LSTM(input_size, hidden_size, num_layers,
batch_first=True, dropout=dropout if num_layers > 1 else 0,
bidirectional=bidirectional)
# Dropout
self.dropout = nn.Dropout(dropout)
# Fully connected layer
self.fc = nn.Linear(hidden_size * self.num_directions, num_classes)
def forward(self, x):
# Initialize hidden and cell states
h0 = torch.zeros(self.num_layers * self.num_directions, x.size(0), self.hidden_size).to(x.device)
c0 = torch.zeros(self.num_layers * self.num_directions, x.size(0), self.hidden_size).to(x.device)
# Forward propagate LSTM
out, _ = self.lstm(x, (h0, c0)) # out: tensor of shape (batch, seq_length, hidden_size)
# Decode the hidden state of the last time step
out = out[:, -1, :]
out = self.dropout(out)
out = self.fc(out)
return out
# PyTorch GRU Implementation
class GRUModel(nn.Module):
def __init__(self, input_size, hidden_size, num_layers, num_classes, dropout=0.5, bidirectional=False):
super(GRUModel, self).__init__()
self.hidden_size = hidden_size
self.num_layers = num_layers
self.bidirectional = bidirectional
self.num_directions = 2 if bidirectional else 1
self.gru = nn.GRU(input_size, hidden_size, num_layers,
batch_first=True, dropout=dropout if num_layers > 1 else 0,
bidirectional=bidirectional)
self.dropout = nn.Dropout(dropout)
self.fc = nn.Linear(hidden_size * self.num_directions, num_classes)
def forward(self, x):
h0 = torch.zeros(self.num_layers * self.num_directions, x.size(0), self.hidden_size).to(x.device)
out, _ = self.gru(x, h0)
out = out[:, -1, :]
out = self.dropout(out)
out = self.fc(out)
return out
# Usage example
vocab_size = 10000
embedding_dim = 128
hidden_size = 256
num_layers = 2
num_classes = 5 # e.g., sentiment classes
# For text classification with embeddings
class TextClassifier(nn.Module):
def __init__(self):
super(TextClassifier, self).__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.lstm = nn.LSTM(embedding_dim, hidden_size, num_layers,
batch_first=True, dropout=0.5, bidirectional=True)
self.fc = nn.Linear(hidden_size * 2, num_classes)
def forward(self, x):
embedded = self.embedding(x)
lstm_out, (hidden, cell) = self.lstm(embedded)
# Concatenate final forward and backward hidden states
hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1)
return self.fc(hidden)
model = TextClassifier()
# Sample input: batch_size=2, sequence_length=20
sample_input = torch.randint(0, vocab_size, (2, 20))
output = model(sample_input)
print(f'Output shape: {output.shape}') # [2, 5]
# TensorFlow/Keras Implementation
import tensorflow as tf
from tensorflow.keras import layers, Model
def create_lstm_model(vocab_size, embedding_dim, hidden_units, num_classes):
inputs = tf.keras.Input(shape=(None,))
x = layers.Embedding(vocab_size, embedding_dim)(inputs)
x = layers.LSTM(hidden_units, return_sequences=False, dropout=0.5)(x)
x = layers.Dense(64, activation='relu')(x)
outputs = layers.Dense(num_classes, activation='softmax')(x)
return Model(inputs, outputs)
def create_bidirectional_lstm(vocab_size, embedding_dim, hidden_units, num_classes):
inputs = tf.keras.Input(shape=(None,))
x = layers.Embedding(vocab_size, embedding_dim)(inputs)
x = layers.Bidirectional(layers.LSTM(hidden_units, return_sequences=True, dropout=0.5))(x)
x = layers.Bidirectional(layers.LSTM(hidden_units // 2, dropout=0.5))(x)
outputs = layers.Dense(num_classes, activation='softmax')(x)
return Model(inputs, outputs)
# Stacked LSTM for sequence generation
class SequenceGenerator(nn.Module):
def __init__(self, vocab_size, embed_size, hidden_size, num_layers):
super(SequenceGenerator, self).__init__()
self.embed = nn.Embedding(vocab_size, embed_size)
self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)
self.fc = nn.Linear(hidden_size, vocab_size)
def forward(self, x, hidden=None):
embedded = self.embed(x)
output, hidden = self.lstm(embedded, hidden)
output = self.fc(output)
return output, hidden
def generate(self, start_token, max_length=100, temperature=1.0):
current = torch.tensor([[start_token]])
hidden = None
output_tokens = [start_token]
for _ in range(max_length):
logits, hidden = self.forward(current, hidden)
logits = logits[:, -1, :] / temperature
probs = torch.softmax(logits, dim=-1)
current = torch.multinomial(probs, num_samples=1)
output_tokens.append(current.item())
return output_tokens
When to Use
✅ Appropriate Use Cases:
- Sequential data where order matters (time series, text, speech)
- When input/output sequences have variable lengths
- Problems requiring memory of past context (language modeling)
- Small to medium-sized datasets where Transformers would overfit
- Resource-constrained environments (mobile, edge devices)
- When strong sequential inductive bias is beneficial
❌ Avoid When:
- Long sequences (> 1000 tokens) where attention mechanisms perform better
- When parallel computation is critical (RNNs are inherently sequential)
- Tasks where global context matters more than local patterns
- When you have abundant compute and data (Transformers usually win)
- Problems that can be framed as non-sequential (use feedforward networks)
- When training time is limited (RNNs train slower than CNNs on fixed-size inputs)
Common Pitfalls
- Not initializing hidden states properly causing training instability
- Forgetting to pack padded sequences leading to computation on padding tokens
- Using too many layers causing overfitting on small sequence datasets
- Not using teacher forcing during sequence generation training
- Applying dropout to recurrent connections (should only apply to non-recurrent)
- Using ReLU instead of tanh in vanilla RNN (causes exploding gradients)
- Not handling variable-length sequences with masking or packing
- Bidirectional RNNs in autoregressive generation (leaks future information)