Pandas Basics: DataFrames and Data Loading
Definition
Pandas is an open-source Python library that provides high-performance, easy-to-use data structures and tools for data analysis. At its core, pandas introduces two primary data structures: the DataFrame and the Series. A DataFrame is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Think of it as a spreadsheet or SQL table where each column can contain different data types (integers, strings, floats, dates). A Series is a one-dimensional labeled array capable of holding any data type, essentially representing a single column of data. Pandas excels at handling structured data, providing functionality for reading from various file formats (CSV, Excel, JSON, Parquet), data cleaning, transformation, aggregation, and visualization preparation. The library builds on top of NumPy, leveraging its array computing capabilities while adding the convenience of labeled indexing and heterogeneous data type support.
Intuition
Imagine a DataFrame as a well-organized Excel spreadsheet or a database table. Each row represents an observation or record (like a customer transaction), and each column represents a feature or attribute (like customer name, purchase amount, or date). The Series is like a single column from that spreadsheet - a list of values that all share the same type, but with the added benefit of labels. Just as you might look up a cell in Excel by its column letter and row number, pandas lets you access data by meaningful names. For example, instead of remembering that column 3 contains 'sales figures', you can simply refer to df['sales_figures']. This labeled approach makes code more readable and reduces errors. The power of pandas comes from its ability to perform operations on entire columns or rows at once, much faster than looping through data manually, while maintaining the intuitive row-and-column structure we're familiar with from spreadsheets.
Real-World Use Cases
A financial analyst uses pandas to load daily stock price CSV files, calculate moving averages, and identify trading signals. DataFrames allow them to handle millions of rows of tick data efficiently while performing time-series analysis.
Hospital administrators use pandas to load patient records from electronic health record (EHR) systems exported as Excel files. They clean and merge demographic data with treatment outcomes to analyze the effectiveness of different protocols.
An e-commerce company loads transaction data from their database into pandas DataFrames to analyze customer purchasing patterns, calculate monthly revenue trends, and identify top-selling products by region.
Implementation
Manual Implementation (No Libraries)
class ManualSeries:
def __init__(self, data, index=None, name=None):
if index is None:
index = list(range(len(data)))
if len(data) != len(index):
raise ValueError('Data and index must have same length')
self.data = dict(zip(index, data))
self.index = index
self.name = name
def __getitem__(self, key):
return self.data[key]
def mean(self):
values = [v for v in self.data.values() if isinstance(v, (int, float))]
return sum(values) / len(values) if values else 0
class ManualDataFrame:
def __init__(self, data, index=None, columns=None):
self.columns = columns if columns else list(data.keys())
self.index = index if index else list(range(len(next(iter(data.values())))))
self._data = {}
for col in self.columns:
col_data = data.get(col, [])
self._data[col] = ManualSeries(col_data, index=self.index, name=col)
def __getitem__(self, key):
if isinstance(key, str):
return self._data[key]
elif isinstance(key, list):
new_data = {k: [self._data[k].data[idx] for idx in self.index] for k in key}
return ManualDataFrame(new_data, index=self.index, columns=key)
def head(self, n=5):
new_index = self.index[:n]
new_data = {col: [self._data[col].data[idx] for idx in new_index] for col in self.columns}
return ManualDataFrame(new_data, index=new_index, columns=self.columns)
def shape(self):
return (len(self.index), len(self.columns))
def describe(self):
stats = {}
for col in self.columns:
series = self._data[col]
values = [v for v in series.data.values() if isinstance(v, (int, float))]
if values:
stats[col] = {'count': len(values), 'mean': sum(values) / len(values), 'min': min(values), 'max': max(values)}
return stats
# Example usage
data = {'name': ['Alice', 'Bob', 'Charlie'], 'age': [25, 30, 35], 'salary': [50000, 60000, 75000]}
df = ManualDataFrame(data)
print(df.head(2))
print(df.describe())
Using Libraries (pandas, numpy)
import pandas as pd
import numpy as np
# Creating DataFrames
data_dict = {'name': ['Alice', 'Bob', 'Charlie'], 'age': [25, 30, 35], 'salary': [50000, 60000, 75000]}
df = pd.DataFrame(data_dict)
# Reading files
# df_csv = pd.read_csv('data.csv')
# df_excel = pd.read_excel('data.xlsx')
# Data inspection
print(df.shape)
print(df.columns)
print(df.head())
print(df.describe())
# Selection and filtering
print(df['name']) # Single column
print(df[['name', 'salary']]) # Multiple columns
print(df.loc[0]) # Row by index
print(df[df['age'] > 30]) # Boolean filtering
# Aggregation
print(df.groupby('department')['salary'].mean())
# Handling missing data
df_with_na = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [5, np.nan, np.nan, 8]})
print(df_with_na.dropna())
print(df_with_na.fillna(0))
When to Use
✅ Appropriate Use Cases:
- Loading and exploring structured data from CSV, Excel, or databases
- Performing data cleaning, transformation, and preprocessing for ML pipelines
- Analyzing tabular data with operations like filtering, grouping, and aggregations
- Working with time-series data and performing date-based operations
- Merging and joining datasets from multiple sources
- Generating summary statistics and exploratory data analysis (EDA)
❌ Avoid When:
- Processing purely numerical array data without labels - use NumPy for better performance
- Building production ML models - use scikit-learn, TensorFlow, or PyTorch directly
- Real-time streaming data processing - use Apache Spark, Kafka, or specialized streaming frameworks
- Very large datasets that don't fit in memory - consider Dask, Polars, or chunking strategies
- Complex graph data - use NetworkX or graph databases
- Unstructured text processing - use NLTK, spaCy, or transformers directly
Common Pitfalls
- Modifying a DataFrame while iterating over it - this can lead to unpredictable behavior. Instead, use vectorized operations or create a new DataFrame.
- Using chained indexing like df[df['col'] > 0]['other_col'] = value - this may not work as expected. Use .loc[] or .copy() for safe assignment.
- Ignoring the SettingWithCopyWarning - this warning indicates you may be modifying a view instead of the original DataFrame, leading to silent data loss.
- Loading entire large files without chunking - this can exhaust memory. Use read_csv(chunksize=...) for large files.
- Not handling missing values explicitly - NaN values can propagate silently through calculations. Always check .isna() and decide on a strategy.
- Forgetting that many pandas operations return new DataFrames - operations don't modify in-place unless inplace=True is specified or reassigned.