Streaming Data: Chunking, Iterators, and Large Dataset Handling

Advanced Data Loading
~6 min read Data Loading

Definition

Streaming data processing is an approach to handling datasets that are too large to fit into memory or data that arrives continuously over time. Unlike batch processing, which requires the entire dataset to be available before processing, streaming processes data incrementally in chunks or as individual records arrive. In Python, this is implemented through iterator patterns (yielding data piece by piece), chunked reading (loading fixed-size segments), and generator functions that produce values on-demand. Key techniques include using pandas' chunksize parameter for reading files in segments, implementing custom generators for memory-efficient iteration, using Python's itertools for stream operations, and applying windowing or tumbling windows for time-series streams. For truly massive datasets or continuous streams, this approach extends to frameworks like Apache Spark Streaming, Kafka consumers, and Dask for parallel stream processing. The core principle is maintaining minimal state - processing each chunk and either discarding it or aggregating results, never holding the entire dataset in memory.

Intuition

💡

Think of streaming data processing like drinking from a fire hose versus filling a bucket first. Batch processing is like waiting to fill a bucket with water before taking a drink - you need a big enough bucket (memory) and must wait for it to fill completely. Streaming is like drinking directly from the hose in gulps - you take manageable sips, swallow, and take the next sip, never needing to hold more than a mouthful at once. If you're counting red marbles in a never-ending stream, you don't collect all marbles first; you simply increment a counter each time you see a red one and let the marble pass through. Similarly, if you're processing a 100GB log file, you don't load it all into RAM; you read 10,000 lines at a time, process them, save results, and move to the next chunk. The memory footprint stays constant regardless of total data size. For time-based streams (like sensor readings), imagine looking through a sliding window - at any moment you only see the last 5 minutes of data, discarding older readings as time progresses. This is the essence of streaming: process-what-you-can-hold, then move on.

Mathematical Formula

Streaming mean update:
\[ mu_n = mu_{n-1} + (x_n - mu_{n-1}) / n \]

Step-by-Step Explanation:

  1. Streaming algorithms maintain constant memory O(k) regardless of total dataset size N, where k is the chunk or window size
  2. Windowed average computes the mean over a sliding window of the last w values, discarding old values as new ones arrive
  3. Streaming mean can be updated incrementally without storing all values; new mean equals old mean plus the deviation of new value normalized by count
  4. Welford's algorithm updates variance in one pass without storing all values, essential for streaming statistics

Real-World Use Cases

Tech

Log processing pipelines stream server logs in real-time to detect anomalies. Instead of storing terabytes of logs, they process each log line as it arrives, maintaining rolling statistics (error rates, response times) in memory and triggering alerts when thresholds are breached.

Finance

High-frequency trading systems stream market data (tick-by-tick prices) to calculate moving averages and volatility. They use sliding windows to maintain VWAP (Volume Weighted Average Price) calculations, discarding old ticks as new ones arrive to keep memory bounded.

Manufacturing

IoT sensor networks on factory floors stream temperature, pressure, and vibration readings. Edge devices process streams locally using windowed aggregations to detect equipment anomalies in real-time without transmitting raw data to the cloud.

Implementation

Manual Implementation (No Libraries)

The manual implementation demonstrates core streaming concepts. StreamingStats implements Welford's algorithm for calculating mean and variance in one pass without storing values. StreamingStatsWindowed uses a deque with maxlen to maintain a sliding window of recent values. The read_chunks generator yields file segments without loading the entire file. These patterns show how streaming maintains constant memory regardless of input size, with each element processed and potentially discarded before the next arrives.
from collections import deque
import math

class StreamingStats:
    def __init__(self):
        self.n = 0
        self.mean = 0.0
        self.m2 = 0.0
    
    def update(self, x):
        self.n += 1
        delta = x - self.mean
        self.mean += delta / self.n
        delta2 = x - self.mean
        self.m2 += delta * delta2
    
    @property
    def variance(self):
        return self.m2 / self.n if self.n > 0 else 0.0
    
    @property
    def std(self):
        return math.sqrt(self.variance)

class StreamingStatsWindowed:
    def __init__(self, window_size):
        self.window = deque(maxlen=window_size)
        self._sum = 0.0
    
    def update(self, x):
        if len(self.window) == self.window.maxlen:
            self._sum -= self.window[0]
        self.window.append(x)
        self._sum += x
    
    @property
    def mean(self):
        return self._sum / len(self.window) if self.window else 0

# Generator for chunked file reading
def read_chunks(filename, chunk_size=1000):
    with open(filename) as f:
        chunk = []
        for line in f:
            chunk.append(line.strip())
            if len(chunk) >= chunk_size:
                yield chunk
                chunk = []
        if chunk:
            yield chunk

# Usage
stats = StreamingStats()
for value in [10, 20, 30, 40, 50]:
    stats.update(value)
print(f'Mean: {stats.mean}, Std: {stats.std}')

Using Libraries (pandas, numpy, dask)

import pandas as pd
import numpy as np

# Create sample large CSV
np.random.seed(42)
for i in range(0, 100000, 10000):
    chunk = pd.DataFrame({'id': range(i, i+10000), 'value': np.random.randn(10000)})
    chunk.to_csv('large_data.csv', mode='a', header=(i==0), index=False)

# Method 1: Chunked reading with pandas
chunk_stats = []
for chunk_num, chunk in enumerate(pd.read_csv('large_data.csv', chunksize=10000)):
    stats = {'chunk': chunk_num, 'mean': chunk['value'].mean(), 'std': chunk['value'].std()}
    chunk_stats.append(stats)
print(f'Processed {len(chunk_stats)} chunks')

# Method 2: Aggregate across chunks
total_sum = 0
total_count = 0
for chunk in pd.read_csv('large_data.csv', chunksize=10000):
    total_sum += chunk['value'].sum()
    total_count += len(chunk)
overall_mean = total_sum / total_count
print(f'Overall mean: {overall_mean:.6f}')

# Method 3: Dask for out-of-core processing
# import dask.dataframe as dd
# ddf = dd.read_csv('large_data.csv')
# result = ddf['value'].mean().compute()

# Cleanup
import os
os.remove('large_data.csv')

When to Use

✅ Appropriate Use Cases:

  • Processing files larger than available RAM - use chunked reading to maintain constant memory usage
  • Real-time data processing where data arrives continuously - use streaming with generators or message queues
  • ETL pipelines processing multi-GB datasets - use chunking with intermediate aggregation
  • Time-series analysis on streaming sensor data - use sliding windows to maintain rolling statistics
  • Log processing and monitoring - process lines as they arrive without storing entire logs
  • Large-scale data transformations before ML training - stream data through preprocessing pipelines

❌ Avoid When:

  • Small datasets that fit comfortably in memory - chunking adds complexity without benefit
  • When you need random access to entire dataset - streaming provides sequential access only
  • Operations requiring global sorting of the entire dataset - streaming can't easily maintain global order
  • Complex graph algorithms requiring full graph structure - need entire dataset in memory
  • When data dependencies exist between distant records - streaming works best with local dependencies
  • Quick exploratory analysis on sample data - use standard pandas for simplicity

Common Pitfalls

  • Accumulating results across chunks without bound - can still run out of memory. Use disk-based aggregation or maintain only summary statistics.
  • Not handling the last chunk properly - files may not divide evenly by chunk size. Always process the final partial chunk.
  • Processing chunks independently without considering state - for time-series or ordered data, you may need to track state between chunks.
  • Not handling chunk boundaries that split logical units - ensure chunks don't break related records by using appropriate boundaries.
  • Using default dtypes causing memory bloat - specify dtypes when reading chunks to minimize memory usage per chunk.
  • Not closing file handles properly - use context managers to ensure resources are released, especially in long-running streaming processes.