File Formats for Data: CSV, Parquet, JSON, Excel, and HDF5

Definition

File formats define how data is structured, stored, and serialized on disk, each optimized for different use cases across the data lifecycle. CSV (Comma-Separated Values) is a plain-text format where each line represents a record and fields are separated by delimiters - simple, human-readable, but inefficient for large datasets and lacking type information. JSON (JavaScript Object Notation) stores data as nested key-value pairs, excellent for semi-structured data and API communication but verbose for tabular data. Parquet is a columnar storage format optimized for analytics, using compression and encoding to achieve high performance for read-heavy analytical workloads. Excel files (.xlsx) store data in worksheets with rich formatting, formulas, and multiple tabs - ideal for business users but limited in scalability. HDF5 (Hierarchical Data Format) is designed for large numerical datasets, supporting hierarchical organization and efficient partial I/O for scientific computing. Understanding these formats' trade-offs in storage efficiency, read/write performance, schema flexibility, and human readability is essential for building efficient data pipelines.

Intuition

💡

Think of file formats as different types of containers for moving and storing your belongings. CSV is like a simple cardboard box with items listed on the outside - easy to open and read, but items get jumbled and the box can become unwieldy if too full. JSON is like a set of labeled plastic bins where each bin can contain smaller labeled bins inside - great for organizing complex, nested items but takes up more space and requires more effort to unpack. Parquet is like a specialized filing cabinet where all documents of the same type are stored together in compressed folders - efficient for finding specific information quickly, but you need the right tools to open it. Excel is like a detailed organizer with colored tabs, summary sheets, and built-in calculators - perfect for presentations but heavy and slow when filled with too much data. HDF5 is like a warehouse with organized sections and the ability to retrieve just one pallet without unloading everything - designed for massive storage with selective access. The key is matching the container to your needs: human readability, storage efficiency, access speed, or compatibility with your tools.

Mathematical Formula

Compression ratio: CR = UncompressedSize / CompressedSize

Step-by-Step Explanation:

Compression ratio measures how much space is saved; higher values indicate better compression (e.g., 10:1 means 10x smaller)
Columnar storage stores each column separately, allowing column-specific compression based on data characteristics (repeated values compress better)
Row storage stores complete records together, which is efficient for reading entire rows but cannot compress as effectively across columns

Real-World Use Cases

Finance

Investment banks use Parquet for storing historical tick data (millions of records per day) in data lakes, achieving 5-10x compression while enabling fast analytical queries. They use CSV only for small exports to Excel for trader review and JSON for API responses to trading systems.

Healthcare

Medical research institutions use HDF5 to store large genomic datasets (GB to TB range) with complex hierarchical structures. Patient records for analysis are stored in Parquet for query performance, while JSON handles HL7 FHIR API data exchange between hospital systems.

Retail

E-commerce platforms use JSON to log user clickstream events from web applications, Parquet for the data warehouse storing transaction history, and Excel exports for monthly business reports shared with non-technical stakeholders.

Implementation

Manual Implementation (No Libraries)

The manual implementation shows fundamental file format operations. CSV reading/writing demonstrates field delimiters and quoting rules. JSON handling shows serialization and deserialization of Python objects. These operations reveal that file formats are essentially contracts for data serialization - CSV for simple tabular data, JSON for structured/hierarchical data. Understanding these basics helps when troubleshooting format-related issues and choosing appropriate formats for specific use cases.

import csv
import json

# Manual CSV writer
def write_csv_manual(data, filename):
    with open(filename, 'w', newline='') as f:
        writer = csv.writer(f)
        writer.writerows(data)

# Manual CSV reader
def read_csv_manual(filename):
    with open(filename, 'r') as f:
        reader = csv.reader(f)
        return list(reader)

# Manual JSON handling
data = {'name': 'Alice', 'age': 30, 'skills': ['Python', 'SQL']}
with open('data.json', 'w') as f:
    json.dump(data, f)

with open('data.json', 'r') as f:
    loaded = json.load(f)

Using Libraries (pandas, numpy, pyarrow)

import pandas as pd
import numpy as np

# Create sample data
df = pd.DataFrame({'id': range(1000), 'value': np.random.randn(1000), 'category': np.random.choice(['A','B','C'], 1000)})

# CSV operations
df.to_csv('data.csv', index=False)
df_csv = pd.read_csv('data.csv')

# Parquet operations (most efficient for analytics)
df.to_parquet('data.parquet', compression='snappy')
df_parquet = pd.read_parquet('data.parquet')

# JSON operations
df.to_json('data.json', orient='records')
df_json = pd.read_json('data.json')

# Excel operations (requires openpyxl)
# df.to_excel('data.xlsx', sheet_name='Data')
# df_excel = pd.read_excel('data.xlsx')

# HDF5 operations
df.to_hdf('data.h5', key='data', mode='w')
df_hdf = pd.read_hdf('data.h5', key='data')

# Compare file sizes
import os
for ext in ['csv', 'parquet', 'json']:
    size = os.path.getsize(f'data.{ext}') / 1024
    print(f'{ext}: {size:.2f} KB')

When to Use

✅ Appropriate Use Cases:

CSV when you need human-readable data exchange, compatibility with Excel, or simple data logging
Parquet when storing large datasets for analytics, requiring columnar compression, or building data lakes
JSON when working with API data, semi-structured/nested data, or JavaScript-based systems
Excel when sharing data with business users, creating reports with formatting, or working with multiple related tables
HDF5 when dealing with large numerical arrays, requiring partial/random access, or storing hierarchical scientific data
Feather or ORC when you need fast I/O for temporary storage or data exchange between Python and R

❌ Avoid When:

CSV for large datasets (millions of rows) - slow parsing and no type information preserved
JSON for purely tabular data - verbose format leads to larger files and slower parsing than alternatives
Excel for data processing pipelines - requires external software, slow for large files, limited to ~1M rows
Parquet when you need frequent row-level updates - columnar formats are optimized for read-heavy workloads
HDF5 for simple tabular business data - overkill for non-numerical data; use Parquet instead
Binary formats when data needs to be inspected or edited manually - prefer CSV or JSON for human readability

Common Pitfalls

Assuming CSV will preserve data types - numbers may be read as strings, dates as text. Always specify dtypes when reading CSV.
Not handling encoding issues in CSV files - use encoding='utf-8' or encoding='latin1' and check for encoding errors.
Storing JSON with default orientation for pandas - use orient='records' for most API use cases, not 'columns' or 'index'.
Reading entire Parquet files when only subset of columns needed - use columns parameter to reduce I/O and memory.
Writing Excel files with large datasets causing memory issues - Excel has practical limits around 1 million rows and 16,384 columns.
Not compressing HDF5 files or using inappropriate compression - balance between storage savings and decompression overhead based on access patterns.