Sequential Modeling: RNN, LSTM, and GRU

Advanced Deep Learning
~9 min read Deep Learning

Definition

Recurrent Neural Networks (RNNs) are neural architectures designed to process sequential data by maintaining hidden state that captures information from previous time steps. Unlike feedforward networks, RNNs share parameters across time steps, making them suitable for variable-length inputs. However, standard RNNs suffer from vanishing gradients when learning long-term dependencies. Long Short-Term Memory (LSTM), introduced by Hochreiter and Schmidhuber in 1997, solved this through a sophisticated gating mechanism with forget, input, and output gates. Gated Recurrent Unit (GRU), proposed by Cho in 2014, simplified this with update and reset gates while maintaining similar performance. These architectures dominated sequence modeling until Transformers emerged, and remain important for tasks with strong sequential inductive bias or limited compute.

Intuition

💡

Imagine reading a book one word at a time while trying to remember what happened earlier. A simple RNN is like a person with limited working memory - they can only remember the immediate context and quickly forget what happened chapters ago. LSTM is like a well-organized filing system: it has a 'forget gate' to discard irrelevant information (like the color of a minor character's shirt), an 'input gate' to store important new details (like discovering the villain's identity), and an 'output gate' to decide what to use right now. The cell state acts as a conveyor belt running through time, allowing information to flow unchanged when needed. GRU simplifies this by combining the forget and input gates into an 'update gate' - instead of separately deciding what to forget and what to remember, it decides what fraction of the new information to blend with old memories. Both give the network selective memory, solving the vanishing gradient problem.

Mathematical Formula

Vanilla RNN:
\[ h_t = \tanh(W_h h_{t-1} + W_x x_t + b) \]
LSTM:
\[ f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) \]
\[ i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) \]
\[ \tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C) \]
\[ C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t \]
\[ o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) \]
\[ h_t = o_t \odot \tanhC_t \]
GRU:
\[ z_t = \sigma(W_z \cdot [h_{t-1}, x_t]) \]
\[ r_t = \sigma(W_r \cdot [h_{t-1}, x_t]) \]
\[ \tilde{h}_t = \tanh(W \cdot [r_t \odot h_{t-1}, x_t]) \]
\[ h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t \]
Backpropagation Through Time:
\[ \frac{\partial L}{\partial W} = \sum_{t=1}^{T} \frac{\partial L_t}{\partial W} \]

Step-by-Step Explanation:

  1. RNN: Hidden state updates via tanh of linear combination of previous state and current input
  2. LSTM Forget Gate \(f_t\): Sigmoid output 0-1 determining what fraction of cell state to keep
  3. LSTM Input Gate \(i_t\): Sigmoid determining what fraction of candidate values to store
  4. LSTM Candidate \(\tilde{C}_t\): New candidate values created from current input and previous hidden state
  5. LSTM Cell State \(C_t\): Linear interpolation between old state (forgotten) and new candidate (added)
  6. LSTM Output Gate \(o_t\): Sigmoid determining what fraction of cell state to output as hidden state
  7. GRU Update Gate \(z_t\): Controls balance between keeping old state and accepting new candidate
  8. GRU Reset Gate \(r_t\): Determines how much past information to forget when computing candidate
  9. BPTT: Sum gradients across time steps since weights are shared

Real-World Use Cases

Machine Translation

Google Translate (pre-Transformer) using bidirectional LSTM encoder-decoder

Speech Recognition

Apple Siri using LSTM acoustic models for phoneme prediction

Text Generation

Character-level RNNs generating Shakespeare-style prose or code

Time Series Prediction

Stock price forecasting and anomaly detection in financial markets

Music Generation

Google Magenta using LSTM to compose original melodies

Named Entity Recognition

BiLSTM-CRF models extracting person names and organizations from text

Implementation

Manual Implementation (No Libraries)

The LSTMCell implements complete forward pass with all four gates (forget, input, candidate, output) and cell state update. The backward pass implements truncated backpropagation through time. GRUCell shows the simpler two-gate architecture. LSTM maintains separate cell and hidden states while GRU merges them, using update gate to control information flow.
import numpy as np

class LSTMCell:
    def __init__(self, input_size, hidden_size):
        self.input_size = input_size
        self.hidden_size = hidden_size
        
        # Xavier initialization
        concat_size = input_size + hidden_size
        self.W_f = np.random.randn(concat_size, hidden_size) * np.sqrt(2.0 / concat_size)
        self.W_i = np.random.randn(concat_size, hidden_size) * np.sqrt(2.0 / concat_size)
        self.W_c = np.random.randn(concat_size, hidden_size) * np.sqrt(2.0 / concat_size)
        self.W_o = np.random.randn(concat_size, hidden_size) * np.sqrt(2.0 / concat_size)
        
        self.b_f = np.zeros((1, hidden_size))
        self.b_i = np.zeros((1, hidden_size))
        self.b_c = np.zeros((1, hidden_size))
        self.b_o = np.zeros((1, hidden_size))
    
    def sigmoid(self, x):
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
    
    def tanh(self, x):
        return np.tanh(x)
    
    def forward(self, x_t, h_prev, c_prev):
        # Concatenate previous hidden state and current input
        concat = np.hstack([h_prev, x_t])
        
        # Forget gate
        f_t = self.sigmoid(np.dot(concat, self.W_f) + self.b_f)
        
        # Input gate
        i_t = self.sigmoid(np.dot(concat, self.W_i) + self.b_i)
        
        # Candidate cell state
        c_tilde = self.tanh(np.dot(concat, self.W_c) + self.b_c)
        
        # Cell state update
        c_t = f_t * c_prev + i_t * c_tilde
        
        # Output gate
        o_t = self.sigmoid(np.dot(concat, self.W_o) + self.b_o)
        
        # Hidden state
        h_t = o_t * self.tanh(c_t)
        
        cache = (x_t, h_prev, c_prev, f_t, i_t, c_tilde, c_t, o_t, h_t, concat)
        return h_t, c_t, cache
    
    def backward(self, dh_t, dc_t, cache, learning_rate=0.01):
        x_t, h_prev, c_prev, f_t, i_t, c_tilde, c_t, o_t, h_t, concat = cache
        
        # Backprop through output gate
        do_t = dh_t * np.tanh(c_t)
        do_t_raw = do_t * o_t * (1 - o_t)
        
        # Backprop through cell state
        dc_t = dc_t + dh_t * o_t * (1 - np.tanh(c_t) ** 2)
        
        # Backprop through gates
        df_t = dc_t * c_prev
        df_t_raw = df_t * f_t * (1 - f_t)
        
        di_t = dc_t * c_tilde
        di_t_raw = di_t * i_t * (1 - i_t)
        
        dc_tilde = dc_t * i_t
        dc_tilde_raw = dc_tilde * (1 - c_tilde ** 2)
        
        # Gradients w.r.t. weights
        dW_f = np.dot(concat.T, df_t_raw)
        dW_i = np.dot(concat.T, di_t_raw)
        dW_c = np.dot(concat.T, dc_tilde_raw)
        dW_o = np.dot(concat.T, do_t_raw)
        
        # Gradients for previous step
        dconcat = (np.dot(df_t_raw, self.W_f.T) + np.dot(di_t_raw, self.W_i.T) +
                   np.dot(dc_tilde_raw, self.W_c.T) + np.dot(do_t_raw, self.W_o.T))
        
        dh_prev = dconcat[:, :self.hidden_size]
        dc_prev = dc_t * f_t
        
        # Update weights
        self.W_f -= learning_rate * dW_f
        self.W_i -= learning_rate * dW_i
        self.W_c -= learning_rate * dW_c
        self.W_o -= learning_rate * dW_o
        
        return dh_prev, dc_prev

class GRUCell:
    def __init__(self, input_size, hidden_size):
        self.input_size = input_size
        self.hidden_size = hidden_size
        
        concat_size = input_size + hidden_size
        self.W_z = np.random.randn(concat_size, hidden_size) * np.sqrt(2.0 / concat_size)
        self.W_r = np.random.randn(concat_size, hidden_size) * np.sqrt(2.0 / concat_size)
        self.W = np.random.randn(concat_size, hidden_size) * np.sqrt(2.0 / concat_size)
        
        self.b_z = np.zeros((1, hidden_size))
        self.b_r = np.zeros((1, hidden_size))
        self.b = np.zeros((1, hidden_size))
    
    def sigmoid(self, x):
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
    
    def tanh(self, x):
        return np.tanh(x)
    
    def forward(self, x_t, h_prev):
        concat = np.hstack([h_prev, x_t])
        
        # Update gate
        z_t = self.sigmoid(np.dot(concat, self.W_z) + self.b_z)
        
        # Reset gate
        r_t = self.sigmoid(np.dot(concat, self.W_r) + self.b_r)
        
        # Candidate hidden state
        concat_reset = np.hstack([r_t * h_prev, x_t])
        h_tilde = self.tanh(np.dot(concat_reset, self.W) + self.b)
        
        # Hidden state update
        h_t = (1 - z_t) * h_prev + z_t * h_tilde
        
        return h_t, (x_t, h_prev, z_t, r_t, h_tilde, concat, concat_reset)

# Demonstration
lstm = LSTMCell(input_size=10, hidden_size=20)
gru = GRUCell(input_size=10, hidden_size=20)

# Single time step\x_t = np.random.randn(1, 10)
h_prev = np.zeros((1, 20))
c_prev = np.zeros((1, 20))

h_lstm, c_lstm, _ = lstm.forward(x_t, h_prev, c_prev)
h_gru, _ = gru.forward(x_t, h_prev)

print(f'LSTM output shape: {h_lstm.shape}')
print(f'GRU output shape: {h_gru.shape}')

Using Libraries (torch, torch.nn, tensorflow, keras)

import torch
import torch.nn as nn

# PyTorch LSTM Implementation
class LSTMModel(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, num_classes, dropout=0.5, bidirectional=False):
        super(LSTMModel, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.bidirectional = bidirectional
        self.num_directions = 2 if bidirectional else 1
        
        # LSTM layer
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, 
                           batch_first=True, dropout=dropout if num_layers > 1 else 0, 
                           bidirectional=bidirectional)
        
        # Dropout
        self.dropout = nn.Dropout(dropout)
        
        # Fully connected layer
        self.fc = nn.Linear(hidden_size * self.num_directions, num_classes)
    
    def forward(self, x):
        # Initialize hidden and cell states
        h0 = torch.zeros(self.num_layers * self.num_directions, x.size(0), self.hidden_size).to(x.device)
        c0 = torch.zeros(self.num_layers * self.num_directions, x.size(0), self.hidden_size).to(x.device)
        
        # Forward propagate LSTM
        out, _ = self.lstm(x, (h0, c0))  # out: tensor of shape (batch, seq_length, hidden_size)
        
        # Decode the hidden state of the last time step
        out = out[:, -1, :]
        out = self.dropout(out)
        out = self.fc(out)
        return out

# PyTorch GRU Implementation
class GRUModel(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, num_classes, dropout=0.5, bidirectional=False):
        super(GRUModel, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.bidirectional = bidirectional
        self.num_directions = 2 if bidirectional else 1
        
        self.gru = nn.GRU(input_size, hidden_size, num_layers, 
                         batch_first=True, dropout=dropout if num_layers > 1 else 0, 
                         bidirectional=bidirectional)
        
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(hidden_size * self.num_directions, num_classes)
    
    def forward(self, x):
        h0 = torch.zeros(self.num_layers * self.num_directions, x.size(0), self.hidden_size).to(x.device)
        out, _ = self.gru(x, h0)
        out = out[:, -1, :]
        out = self.dropout(out)
        out = self.fc(out)
        return out

# Usage example
vocab_size = 10000
embedding_dim = 128
hidden_size = 256
num_layers = 2
num_classes = 5  # e.g., sentiment classes

# For text classification with embeddings
class TextClassifier(nn.Module):
    def __init__(self):
        super(TextClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_size, num_layers, 
                           batch_first=True, dropout=0.5, bidirectional=True)
        self.fc = nn.Linear(hidden_size * 2, num_classes)
    
    def forward(self, x):
        embedded = self.embedding(x)
        lstm_out, (hidden, cell) = self.lstm(embedded)
        # Concatenate final forward and backward hidden states
        hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1)
        return self.fc(hidden)

model = TextClassifier()

# Sample input: batch_size=2, sequence_length=20
sample_input = torch.randint(0, vocab_size, (2, 20))
output = model(sample_input)
print(f'Output shape: {output.shape}')  # [2, 5]

# TensorFlow/Keras Implementation
import tensorflow as tf
from tensorflow.keras import layers, Model

def create_lstm_model(vocab_size, embedding_dim, hidden_units, num_classes):
    inputs = tf.keras.Input(shape=(None,))
    x = layers.Embedding(vocab_size, embedding_dim)(inputs)
    x = layers.LSTM(hidden_units, return_sequences=False, dropout=0.5)(x)
    x = layers.Dense(64, activation='relu')(x)
    outputs = layers.Dense(num_classes, activation='softmax')(x)
    return Model(inputs, outputs)

def create_bidirectional_lstm(vocab_size, embedding_dim, hidden_units, num_classes):
    inputs = tf.keras.Input(shape=(None,))
    x = layers.Embedding(vocab_size, embedding_dim)(inputs)
    x = layers.Bidirectional(layers.LSTM(hidden_units, return_sequences=True, dropout=0.5))(x)
    x = layers.Bidirectional(layers.LSTM(hidden_units // 2, dropout=0.5))(x)
    outputs = layers.Dense(num_classes, activation='softmax')(x)
    return Model(inputs, outputs)

# Stacked LSTM for sequence generation
class SequenceGenerator(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers):
        super(SequenceGenerator, self).__init__()
        self.embed = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, vocab_size)
    
    def forward(self, x, hidden=None):
        embedded = self.embed(x)
        output, hidden = self.lstm(embedded, hidden)
        output = self.fc(output)
        return output, hidden
    
    def generate(self, start_token, max_length=100, temperature=1.0):
        current = torch.tensor([[start_token]])
        hidden = None
        output_tokens = [start_token]
        
        for _ in range(max_length):
            logits, hidden = self.forward(current, hidden)
            logits = logits[:, -1, :] / temperature
            probs = torch.softmax(logits, dim=-1)
            current = torch.multinomial(probs, num_samples=1)
            output_tokens.append(current.item())
        
        return output_tokens

When to Use

✅ Appropriate Use Cases:

  • Sequential data where order matters (time series, text, speech)
  • When input/output sequences have variable lengths
  • Problems requiring memory of past context (language modeling)
  • Small to medium-sized datasets where Transformers would overfit
  • Resource-constrained environments (mobile, edge devices)
  • When strong sequential inductive bias is beneficial

❌ Avoid When:

  • Long sequences (> 1000 tokens) where attention mechanisms perform better
  • When parallel computation is critical (RNNs are inherently sequential)
  • Tasks where global context matters more than local patterns
  • When you have abundant compute and data (Transformers usually win)
  • Problems that can be framed as non-sequential (use feedforward networks)
  • When training time is limited (RNNs train slower than CNNs on fixed-size inputs)

Common Pitfalls

  • Not initializing hidden states properly causing training instability
  • Forgetting to pack padded sequences leading to computation on padding tokens
  • Using too many layers causing overfitting on small sequence datasets
  • Not using teacher forcing during sequence generation training
  • Applying dropout to recurrent connections (should only apply to non-recurrent)
  • Using ReLU instead of tanh in vanilla RNN (causes exploding gradients)
  • Not handling variable-length sequences with masking or packing
  • Bidirectional RNNs in autoregressive generation (leaks future information)