Relationship Plots: Visualizing Associations Between Variables

Intermediate Eda
~2 min read Eda

Definition

Relationship plots visualize associations between two or more variables, revealing patterns, trends, clusters, and outliers that numerical summaries alone cannot capture. While correlation coefficients quantify linear relationships, plots show the full picture - revealing non-linear patterns, heteroscedasticity, influential outliers, and clusters indicating subgroups.

Intuition

💡

Imagine you are a detective examining clues. Each variable is a witness. A scatter plot places two witnesses' testimonies side by side - you see if they align, contradict, or are unrelated. But plots show more: if the relationship is straight or curved, and if there are suspicious outliers.

Mathematical Formula

Linear Regression:
\[ \quad y = \beta_0 + \beta_1 x + \epsilon \]
Slope:
\[ \quad \beta_1 = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2} \]
R-squared:
\[ \quad R^2 = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2} \]

Step-by-Step Explanation:

  1. Linear regression: Fits a straight line minimizing sum of squared residuals.
  2. Slope: Change in y per unit change in x.
  3. R-squared: Proportion of variance in y explained by x.

Interactive Demo

Scatter plot with trend line Example Data

Real-World Use Cases

Real Estate

Scatter plots reveal the relationship between house size and price, often showing non-linear patterns.

Healthcare

Scatter plots of BMI vs blood pressure reveal clusters and relationships.

Finance

Correlation heatmaps of asset returns guide portfolio diversification.

Implementation

Manual Implementation (No Libraries)

The manual implementation calculates linear regression using least squares. The slope tells us how much y changes for each unit change in x.
import numpy as np
import matplotlib.pyplot as plt

def linear_regression(x, y):
    n = len(x)
    x_mean = np.mean(x)
    y_mean = np.mean(y)
    slope = sum((xi - x_mean) * (yi - y_mean) for xi, yi in zip(x, y)) / sum((xi - x_mean)**2 for xi in x)
    intercept = y_mean - slope * x_mean
    return slope, intercept

np.random.seed(42)
x = np.random.randn(100)
y = 2*x + np.random.randn(100) * 0.5
slope, intercept = linear_regression(x, y)
print(f'y = {slope:.2f}x + {intercept:.2f}')
plt.scatter(x, y, alpha=0.6)
plt.plot(x, slope*x + intercept, 'r-')
plt.savefig(f'{output_dir}/scatter_regression.png')

Using Libraries (numpy, pandas, matplotlib, seaborn)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(42)
df = pd.DataFrame({'x': np.random.randn(100), 'y': 2*np.random.randn(100) + np.random.randn(100)})

sns.regplot(data=df, x='x', y='y', scatter_kws={'alpha':0.5})
plt.title('Scatter with Regression')
plt.show()

sns.heatmap(df.corr(), annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Heatmap')
plt.show()

When to Use

✅ Appropriate Use Cases:

  • Scatter plot: When exploring relationships between two continuous variables
  • Hexbin plot: When dealing with large datasets
  • Bubble chart: When adding a third dimension through size
  • Heatmap: When examining relationships among many variables

❌ Avoid When:

  • Do not use scatter plots with >10000 points without transparency
  • Do not assume linearity from scatter plots
  • Do not confuse correlation with causation

Common Pitfalls

  • Overplotting with large datasets
  • Ignoring outliers
  • Confusing correlation with causation