Categorical Data Analysis: Analyzing Discrete Variables
Definition
Categorical data analysis encompasses methods for exploring, summarizing, and testing relationships involving discrete variables. Unlike continuous data, categorical data consists of distinct groups or categories. These variables can be nominal (no inherent order) or ordinal (natural ordering). Key techniques include frequency tables, cross-tabulation, chi-square tests, bar charts, and mosaic plots.
Intuition
Think of categorical analysis like organizing a party. You have different groups of guests (categories). You count how many are in each group (frequency table). You might wonder if certain groups have different preferences (chi-square test). Bar charts are like seating charts - you can instantly see which groups are largest.
Mathematical Formula
Step-by-Step Explanation:
- Proportion: Relative frequency from 0 to 1.
- Expected frequency: What we would expect if variables were independent.
- Chi-square: Sum of squared standardized differences.
Interactive Demo
Real-World Use Cases
Analyzing survey responses about brand preference by age group. Cross-tabulation shows which demographics prefer which brands.
Testing if treatment outcomes differ between drug and placebo groups using 2x2 contingency tables.
Analyzing purchase categories by customer segment. Cross-tabs reveal which segments buy which products.
Implementation
Manual Implementation (No Libraries)
from collections import Counter
def frequency_table(data):
counts = Counter(data)
total = len(data)
return {cat: {'freq': count, 'prop': count/total} for cat, count in counts.items()}
def chi_square_2x2(observed):
a, b = observed[0]
c, d = observed[1]
n = a + b + c + d
expected_a = (a+b)*(a+c)/n
expected_b = (a+b)*(b+d)/n
expected_c = (c+d)*(a+c)/n
expected_d = (c+d)*(b+d)/n
chi2 = ((a-expected_a)**2/expected_a + (b-expected_b)**2/expected_b +
(c-expected_c)**2/expected_c + (d-expected_d)**2/expected_d)
return chi2
fruits = ['apple', 'banana', 'apple', 'orange', 'banana', 'apple']
print(frequency_table(fruits))
observed = [[30, 20], [25, 35]]
print(f'Chi-square: {chi_square_2x2(observed):.2f}')
Using Libraries (pandas, numpy, scipy, matplotlib)
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
np.random.seed(42)
df = pd.DataFrame({
'category': np.random.choice(['A', 'B', 'C'], 100),
'outcome': np.random.choice(['Success', 'Failure'], 100)
})
print(df['category'].value_counts())
crosstab = pd.crosstab(df['category'], df['outcome'])
print(crosstab)
chi2, p, dof, expected = stats.chi2_contingency(crosstab)
print(f'Chi-square: {chi2:.4f}, p-value: {p:.4f}')
crosstab.plot(kind='bar')
plt.title('Categorical Comparison')
plt.show()
When to Use
✅ Appropriate Use Cases:
- Frequency tables: Summarizing a single categorical variable
- Cross-tabulation: Examining relationship between two categorical variables
- Chi-square test: Testing if variables are independent
- Bar charts: Comparing frequencies across categories
❌ Avoid When:
- Do not use chi-square when expected frequencies < 5
- Do not treat ordinal data as continuous without consideration
- Do not use pie charts for precise comparison
Common Pitfalls
- Low expected frequencies
- Simpson's paradox
- Confounding variables