Model Versioning with MLflow and DVC

Definition

Model versioning is the practice of systematically tracking and managing different versions of machine learning models throughout their lifecycle, similar to how Git tracks code changes. MLflow is an open-source platform for the complete ML lifecycle including experiment tracking, model packaging, and model registry. DVC (Data Version Control) extends Git to handle large data files, ML models, and pipelines that don't fit well in traditional version control. Together, MLflow and DVC solve complementary problems: MLflow manages the model lifecycle from experimentation to deployment, while DVC versions datasets and tracks data lineage. Model versioning ensures reproducibility by capturing the exact code, data, and parameters used to create each model version. This enables rollback to previous versions, A/B testing between model variants, and maintaining multiple model versions in production. The registry pattern separates model development from deployment, allowing data scientists to push new versions while ops teams manage production deployments.

Intuition

💡

Imagine you're writing a novel, but instead of saving each draft, you just overwrite the same file. You'd lose all your previous work and couldn't compare versions. Now imagine you also had external research files (datasets) too large to email. Model versioning is like having a smart filing system that: saves every draft (model versions), tracks which research files (data) went into each draft, remembers the writing conditions (hyperparameters), and lets publishers (production) pick which draft to print while you keep writing new ones.

Mathematical Formula

Reproducibility Equation:

M_v = f(C_v, D_v, P_v)

where v denotes versioned components

Step-by-Step Explanation:

\(M_v\): model version v (reproducible artifact)
\(C_v\): code version at time of training
\(D_v\): dataset version used for training
\(P_v\): hyperparameters and configuration
f: training function (deterministic given seeds)
Git tracks C_v, DVC tracks D_v, MLflow tracks P_v and M_v

Real-World Use Cases

Fintech

Credit model rollback: When a deployed fraud detection model starts underperforming after a data drift event, quickly rollback to the previous stable version while investigating the issue, ensuring minimal business impact.

Healthcare

Regulatory audit trail: Maintain complete version history of diagnostic models with exact dataset versions and training parameters for FDA approval and compliance reviews.

E-commerce

A/B testing recommendations: Run two model versions simultaneously - v2.1 for 90% of users and experimental v2.2 for 10%, using MLflow registry to manage the canary deployment.

Autonomous Vehicles

Safety-critical model updates: Version every perception model with associated training data, sensor calibration parameters, and validation results for traceability in incident investigations.

Implementation

Manual Implementation (No Libraries)

Manual versioning is brittle and doesn't scale. Teams end up with scattered model files, inconsistent metadata, and no way to reliably reproduce experiments. MLflow and DVC provide structured solutions.

# Manual model versioning - brittle
import pickle
import json
import os
from datetime import datetime

# Create version directory
version = 'v1.2.3'
os.makedirs(f'models/{version}', exist_ok=True)

# Save model
with open(f'models/{version}/model.pkl', 'wb') as f:
    pickle.dump(model, f)

# Save metadata manually
metadata = {
    'version': version,
    'created': datetime.now().isoformat(),
    'accuracy': 0.95,
    'training_data': 'data/train_v2.csv',
    'hyperparameters': {'lr': 0.001, 'epochs': 100},
    'git_commit': 'abc123'
}
with open(f'models/{version}/metadata.json', 'w') as f:
    json.dump(metadata, f)

# Problems:
# - No central registry
# - Manual metadata tracking is error-prone
# - No data versioning
# - Can't track large files in Git
# - No model promotion workflow
# - Hard to reproduce exactly

Using Libraries (mlflow, dvc)

# MLflow Model Tracking and Registry
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd

# Start MLflow tracking server:
# mlflow server --backend-store-uri sqlite:///mlflow.db --default-artifact-root ./artifacts

# Set tracking URI
mlflow.set_tracking_uri('http://localhost:5000')
mlflow.set_experiment('customer-churn')

# Training with MLflow logging
with mlflow.start_run():
    # Log parameters
    mlflow.log_param('n_estimators', 100)
    mlflow.log_param('max_depth', 10)
    mlflow.log_param('min_samples_split', 5)
    
    # Train model
    model = RandomForestClassifier(
        n_estimators=100, max_depth=10, min_samples_split=5
    )
    model.fit(X_train, y_train)
    
    # Log metrics
    train_acc = model.score(X_train, y_train)
    val_acc = model.score(X_val, y_val)
    mlflow.log_metric('train_accuracy', train_acc)
    mlflow.log_metric('val_accuracy', val_acc)
    
    # Log model to registry
    mlflow.sklearn.log_model(
        model,
        artifact_path='model',
        registered_model_name='churn-predictor'
    )
    
    # Log artifacts
    mlflow.log_artifact('confusion_matrix.png')
    mlflow.log_artifact('feature_importance.csv')

# Model Registry Operations
from mlflow.tracking import MlflowClient

client = MlflowClient()

# List registered models
models = client.search_registered_models()
for model in models:
    print(f'Model: {model.name}')

# Get latest version
versions = client.get_latest_versions('churn-predictor')
for version in versions:
    print(f'Version {version.version}: {version.status}')

# Transition model stage
client.transition_model_version_stage(
    name='churn-predictor',
    version=3,
    stage='Staging'
)

# Load specific version
model_uri = 'models:/churn-predictor/3'
# or 'models:/churn-predictor/Staging'
model = mlflow.sklearn.load_model(model_uri)

# DVC for Data Versioning
# Initialize DVC
# dvc init

# Track dataset
# dvc add data/train.csv
# git add data/train.csv.dvc .gitignore
# git commit -m 'Add training data v1'

# Python DVC integration
import dvc.api

# Load specific data version
data_url = dvc.api.get_url(
    path='data/train.csv',
    repo='https://github.com/org/repo',
    rev='v1.0'  # Git tag or commit
)

# Or use DVC Python API
from dvc.repo import Repo

repo = Repo('.')

# Pull specific version
repo.pull('data/train.csv')

# Reproduce pipeline
# dvc repro

# DVC Pipeline Definition (dvc.yaml)
"""
stages:
  prepare:
    cmd: python src/prepare.py data/raw.csv data/prepared.csv
    deps:
      - data/raw.csv
      - src/prepare.py
    outs:
      - data/prepared.csv
  
  train:
    cmd: python src/train.py data/prepared.csv model.pkl
    deps:
      - data/prepared.csv
      - src/train.py
    outs:
      - model.pkl
    metrics:
      - metrics.json:
          cache: false
"""

# Run DVC pipeline
# dvc repro

# Push to remote storage
# dvc remote add -d myremote s3://mybucket/dvc
# dvc push

When to Use

✅ Appropriate Use Cases:

Multiple model versions in production (A/B testing)
Need for model rollback capabilities
Team collaboration on model development
Regulatory compliance requiring audit trails
Large datasets that don't fit in Git
Reproducible ML pipelines
Model promotion workflows (dev → staging → prod)
Tracking data lineage alongside models
Sharing models across teams or organizations
Automated model deployment pipelines
Experiment tracking with artifact storage

❌ Avoid When:

Single model with no versioning needs
Very small datasets (Git handles fine)
Prototypes with no production path
When using managed platforms (SageMaker, Vertex AI)
Teams with simpler needs (just pickle + Git LFS)
When existing ML platform already provides versioning
Proof-of-concept projects without collaboration
Simple scripts with no data dependencies

Common Pitfalls

Not versioning data alongside models (reproducibility gap)
Using MLflow without setting up artifact storage
Forgetting to log model dependencies
Not setting up DVC remote for team collaboration
Overwriting production models without staging
Not tagging Git commits when registering models
Ignoring data drift between model versions
Not backing up DVC cache
Mixing experiment tracking with model registry
Not cleaning up old model versions (storage costs)
Missing dependency logging (Python version, libraries)
Not testing model loading before registry
Hardcoding model paths instead of using registry URIs