Refactoring ML Code: From Notebooks to Production

Introduction

Jupyter notebooks are great for experimentation and exploration, but they’re not ideal for production code. Notebooks often contain everything in one place, with data loading, preprocessing, model training, and evaluation all mixed together. This makes code hard to test, reuse, and maintain.

In this post, I’ll walk through refactoring a typical ML notebook into clean, modular, production-ready Python code. We’ll apply software engineering principles like separation of concerns, dependency injection, and proper error handling. We’ll also use Pydantic v2 for data validation and type safety throughout the codebase. By the end, you’ll have a codebase that’s testable, maintainable, and ready for deployment.

The complete code for this project is available on GitHub: ml-refactoring-example

The Problem with Notebook Code

Let’s start with a typical notebook example. This code trains a sentiment classifier, but it has several issues:

# Everything in one cell - typical notebook style
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import pickle

# Load data
df = pd.read_csv('sentiment_data.csv')
print(f"Loaded {len(df)} rows")

# Preprocess
df['text'] = df['text'].str.lower()
df['text'] = df['text'].str.replace('[^a-zA-Z0-9\s]', '', regex=True)

# Split data
X = df['text']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
vectorizer = TfidfVectorizer(max_features=5000)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

model = LogisticRegression()
model.fit(X_train_vec, y_train)

# Evaluate
y_pred = model.predict(X_test_vec)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(classification_report(y_test, y_pred))

# Save model
with open('model.pkl', 'wb') as f:
    pickle.dump(model, f)
with open('vectorizer.pkl', 'wb') as f:
    pickle.dump(vectorizer, f)

Problems with this code:

Everything is in one place (hard to test individual components)
Hard-coded file paths and parameters
No error handling
No logging
Difficult to reuse preprocessing logic
Can’t easily swap different models or vectorizers
Hard to test individual functions

Refactoring Strategy

We’ll refactor this into a modular structure with:

Data loading module - Handles data I/O
Preprocessing module - Text cleaning and transformation
Model training module - Model creation and training
Evaluation module - Metrics and reporting
Configuration - Centralized settings
Main script - Orchestrates everything

Step-by-Step Refactoring

1. Configuration Management

First, let’s extract all configuration into a separate file using Pydantic v2:

# config.py
from pydantic import BaseModel, Field, field_validator
from pathlib import Path
from typing import Optional

class Config(BaseModel):
    """Configuration model with validation."""
    
    # Data paths
    data_path: str = Field(default="data/sentiment_data.csv", description="Path to training data")
    model_dir: Path = Field(default=Path("models"), description="Directory for saved models")
    
    # Model parameters
    test_size: float = Field(default=0.2, ge=0.0, le=1.0, description="Test set proportion")
    random_state: int = Field(default=42, description="Random seed for reproducibility")
    max_features: int = Field(default=5000, gt=0, description="Maximum features for vectorizer")
    
    # Output paths
    model_filename: str = Field(default="sentiment_classifier.pkl", description="Model filename")
    vectorizer_filename: str = Field(default="vectorizer.pkl", description="Vectorizer filename")
    
    @field_validator('model_dir', mode='after')
    @classmethod
    def create_model_dir(cls, v: Path) -> Path:
        """Create model directory if it doesn't exist."""
        v.mkdir(parents=True, exist_ok=True)
        return v
    
    @field_validator('data_path')
    @classmethod
    def validate_data_path(cls, v: str) -> str:
        """Validate data path exists."""
        if not Path(v).exists():
            raise ValueError(f"Data file not found: {v}")
        return v
    
    model_config = {
        "frozen": False,  # Allow mutation if needed
        "validate_assignment": True,  # Validate on assignment
    }

Benefits:

All configuration in one place
Automatic validation of values (test_size between 0 and 1, max_features > 0)
Can load from environment variables or config files
Type-safe with Pydantic models
Better error messages for invalid configuration

2. Data Loading Module

Extract data loading into a dedicated module with Pydantic validation:

# data_loader.py
import pandas as pd
from pathlib import Path
from typing import Tuple
import logging
from pydantic import BaseModel, Field, field_validator

logger = logging.getLogger(__name__)

class DataConfig(BaseModel):
    """Configuration for data loading."""
    file_path: str = Field(..., description="Path to data file")
    text_column: str = Field(default="text", description="Name of text column")
    label_column: str = Field(default="label", description="Name of label column")
    
    @field_validator('file_path')
    @classmethod
    def validate_file_exists(cls, v: str) -> str:
        """Validate that file exists."""
        path = Path(v)
        if not path.exists():
            raise FileNotFoundError(f"Data file not found: {v}")
        return v

class SplitConfig(BaseModel):
    """Configuration for train-test split."""
    test_size: float = Field(default=0.2, ge=0.0, le=1.0)
    random_state: int = Field(default=42)

def load_data(config: DataConfig) -> pd.DataFrame:
    """Load data from CSV file with validation."""
    try:
        path = Path(config.file_path)
        df = pd.read_csv(path)
        
        # Validate required columns exist
        if config.text_column not in df.columns:
            raise ValueError(f"Text column '{config.text_column}' not found in data")
        if config.label_column not in df.columns:
            raise ValueError(f"Label column '{config.label_column}' not found in data")
        
        logger.info(f"Loaded {len(df)} rows from {config.file_path}")
        return df
    except Exception as e:
        logger.error(f"Error loading data: {e}")
        raise

def split_data(
    X: pd.Series, 
    y: pd.Series, 
    config: SplitConfig
) -> Tuple[pd.Series, pd.Series, pd.Series, pd.Series]:
    """Split data into training and testing sets."""
    from sklearn.model_selection import train_test_split
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=config.test_size, random_state=config.random_state
    )
    logger.info(f"Split data: {len(X_train)} train, {len(X_test)} test")
    return X_train, X_test, y_train, y_test

Benefits:

Reusable data loading logic
Proper error handling
Logging for debugging
Type hints for clarity
Easy to test

3. Preprocessing Module

Separate preprocessing into its own module with Pydantic models:

# preprocessing.py
import pandas as pd
import re
from typing import Callable
import logging
from pydantic import BaseModel, Field, field_validator

logger = logging.getLogger(__name__)

class TextInput(BaseModel):
    """Input model for text validation."""
    text: str = Field(..., min_length=1, description="Text to preprocess")
    
    @field_validator('text')
    @classmethod
    def validate_text(cls, v: str) -> str:
        """Validate and clean text input."""
        if not v or not v.strip():
            raise ValueError("Text cannot be empty")
        return v.strip()

def clean_text(text: str) -> str:
    """Clean text by lowercasing and removing special characters."""
    text = text.lower()
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    return text

def preprocess_dataframe(df: pd.DataFrame, text_column: str = 'text') -> pd.DataFrame:
    """Preprocess text column in dataframe."""
    df = df.copy()
    df[text_column] = df[text_column].apply(clean_text)
    logger.info(f"Preprocessed {len(df)} rows")
    return df

class TextPreprocessor:
    """Preprocessor class for dependency injection."""
    
    def __init__(self, cleaning_fn: Callable[[str], str] = clean_text):
        self.cleaning_fn = cleaning_fn
    
    def transform(self, text: str) -> str:
        """Transform a single text with validation."""
        # Validate input using Pydantic
        validated = TextInput(text=text)
        return self.cleaning_fn(validated.text)
    
    def transform_batch(self, texts: pd.Series) -> pd.Series:
        """Transform a series of texts."""
        return texts.apply(self.transform)

Benefits:

Preprocessing logic is reusable
Can inject different cleaning functions
Easy to test individual functions
Can extend with more preprocessing steps

4. Model Training Module

Create a flexible model training module:

# model_trainer.py
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.base import BaseEstimator
from typing import Tuple, Any
import logging
import pickle
from pathlib import Path

logger = logging.getLogger(__name__)

class ModelTrainer:
    """Trainer class for ML models."""
    
    def __init__(
        self,
        vectorizer: Any = None,
        model: BaseEstimator = None,
        max_features: int = 5000
    ):
        self.vectorizer = vectorizer or TfidfVectorizer(max_features=max_features)
        self.model = model or LogisticRegression()
        self.is_fitted = False
    
    def fit(self, X_train: pd.Series, y_train: pd.Series) -> 'ModelTrainer':
        """Train the model."""
        logger.info("Fitting vectorizer and model...")
        X_train_vec = self.vectorizer.fit_transform(X_train)
        self.model.fit(X_train_vec, y_train)
        self.is_fitted = True
        logger.info("Model training complete")
        return self
    
    def transform(self, X: pd.Series) -> Any:
        """Transform text to features."""
        if not self.is_fitted:
            raise ValueError("Model must be fitted before transform")
        return self.vectorizer.transform(X)
    
    def predict(self, X: pd.Series) -> Any:
        """Make predictions."""
        X_vec = self.transform(X)
        return self.model.predict(X_vec)
    
    def save(self, model_path: Path, vectorizer_path: Path) -> None:
        """Save model and vectorizer."""
        with open(model_path, 'wb') as f:
            pickle.dump(self.model, f)
        with open(vectorizer_path, 'wb') as f:
            pickle.dump(self.vectorizer, f)
        logger.info(f"Saved model to {model_path} and vectorizer to {vectorizer_path}")
    
    @classmethod
    def load(cls, model_path: Path, vectorizer_path: Path) -> Tuple[Any, Any]:
        """Load model and vectorizer."""
        with open(model_path, 'rb') as f:
            model = pickle.load(f)
        with open(vectorizer_path, 'rb') as f:
            vectorizer = pickle.load(f)
        logger.info(f"Loaded model from {model_path} and vectorizer from {vectorizer_path}")
        return model, vectorizer

Benefits:

Can inject different vectorizers or models
Encapsulates training logic
Easy to extend with new model types
Proper state management (is_fitted flag)
Save/load functionality

5. Evaluation Module

Separate evaluation logic:

# evaluator.py
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from typing import Any, Dict
import logging

logger = logging.getLogger(__name__)

class ModelEvaluator:
    """Evaluate model performance."""
    
    @staticmethod
    def evaluate(y_true: Any, y_pred: Any) -> Dict[str, Any]:
        """Calculate evaluation metrics."""
        accuracy = accuracy_score(y_true, y_pred)
        report = classification_report(y_true, y_pred, output_dict=True)
        
        metrics = {
            'accuracy': accuracy,
            'classification_report': report
        }
        
        logger.info(f"Model accuracy: {accuracy:.4f}")
        return metrics
    
    @staticmethod
    def print_report(y_true: Any, y_pred: Any) -> None:
        """Print detailed classification report."""
        print(classification_report(y_true, y_pred))
        print("\nConfusion Matrix:")
        print(confusion_matrix(y_true, y_pred))

Benefits:

Reusable evaluation logic
Can extend with more metrics
Separates evaluation from training
Easy to test

6. Main Script

Now we can create a clean main script that orchestrates everything:

# train.py
import logging
from pathlib import Path
from config import Config
from data_loader import load_data, split_data, DataConfig, SplitConfig
from preprocessing import preprocess_dataframe
from model_trainer import ModelTrainer
from evaluator import ModelEvaluator

# Setup logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

def main():
    """Main training pipeline."""
    # Load and validate configuration
    config = Config()
    
    # Load data with validation
    data_config = DataConfig(file_path=config.data_path)
    df = load_data(data_config)
    
    # Preprocess
    df = preprocess_dataframe(df, text_column=data_config.text_column)
    
    # Prepare features and labels
    X = df[data_config.text_column]
    y = df[data_config.label_column]
    
    # Split data with validation
    split_config = SplitConfig(
        test_size=config.test_size,
        random_state=config.random_state
    )
    X_train, X_test, y_train, y_test = split_data(X, y, split_config)
    
    # Train model
    trainer = ModelTrainer(max_features=config.max_features)
    trainer.fit(X_train, y_train)
    
    # Evaluate
    y_pred = trainer.predict(X_test)
    evaluator = ModelEvaluator()
    metrics = evaluator.evaluate(y_test, y_pred)
    evaluator.print_report(y_test, y_pred)
    
    # Save model
    model_path = config.model_dir / config.model_filename
    vectorizer_path = config.model_dir / config.vectorizer_filename
    trainer.save(model_path, vectorizer_path)
    
    logging.info("Training pipeline complete")

if __name__ == "__main__":
    main()

Benefits:

Clean, readable main script
Easy to understand the pipeline
Each step is modular and testable
Can easily add new steps

Project Structure

After refactoring, your project structure looks like this:

sentiment-classifier/
├── config.py           # Configuration
├── data_loader.py      # Data loading
├── preprocessing.py    # Text preprocessing
├── model_trainer.py    # Model training
├── evaluator.py        # Model evaluation
├── train.py           # Main training script
├── predict.py         # Inference script (we'll add this)
└── tests/             # Unit tests
    ├── test_preprocessing.py
    ├── test_model_trainer.py
    └── test_evaluator.py

Adding Tests

Now that code is modular, we can easily write tests:

# tests/test_preprocessing.py
import pytest
from preprocessing import clean_text, preprocess_dataframe
import pandas as pd

def test_clean_text():
    """Test text cleaning function."""
    assert clean_text("Hello World!") == "hello world"
    assert clean_text("Test@123") == "test123"
    assert clean_text("") == ""

def test_preprocess_dataframe():
    """Test dataframe preprocessing."""
    df = pd.DataFrame({
        'text': ['Hello World!', 'Test@123'],
        'label': [1, 0]
    })
    result = preprocess_dataframe(df)
    assert result['text'].iloc[0] == "hello world"
    assert result['text'].iloc[1] == "test123"

# tests/test_model_trainer.py
import pytest
import pandas as pd
from model_trainer import ModelTrainer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

def test_model_trainer_fit():
    """Test model training."""
    X_train = pd.Series(["good movie", "bad movie", "great film"])
    y_train = pd.Series([1, 0, 1])
    
    trainer = ModelTrainer()
    trainer.fit(X_train, y_train)
    
    assert trainer.is_fitted == True

def test_model_trainer_predict():
    """Test model prediction."""
    X_train = pd.Series(["good", "bad"])
    y_train = pd.Series([1, 0])
    X_test = pd.Series(["good movie"])
    
    trainer = ModelTrainer()
    trainer.fit(X_train, y_train)
    predictions = trainer.predict(X_test)
    
    assert len(predictions) == 1

Adding an Inference Script

With the refactored code, creating an inference script is straightforward. We’ll use Pydantic for input validation:

# predict.py
import logging
from pathlib import Path
from model_trainer import ModelTrainer
from preprocessing import TextPreprocessor, TextInput
from config import Config
from pydantic import BaseModel, Field, field_validator
import pandas as pd

logging.basicConfig(level=logging.INFO)

class PredictionRequest(BaseModel):
    """Request model for predictions."""
    text: str = Field(..., min_length=1, description="Text to classify")
    model_path: Path | None = Field(default=None, description="Optional custom model path")
    
    @field_validator('text')
    @classmethod
    def validate_text(cls, v: str) -> str:
        """Validate text input."""
        if not v or not v.strip():
            raise ValueError("Text cannot be empty")
        return v.strip()

class PredictionResponse(BaseModel):
    """Response model for predictions."""
    text: str = Field(..., description="Original text")
    prediction: str = Field(..., description="Predicted label")
    confidence: float | None = Field(default=None, description="Prediction confidence")

def predict(request: PredictionRequest, config: Config | None = None) -> PredictionResponse:
    """Predict sentiment for a single text with validation."""
    if config is None:
        config = Config()
    
    # Load model and vectorizer
    if request.model_path:
        model_path = request.model_path
        vectorizer_path = request.model_path.parent / config.vectorizer_filename
    else:
        model_path = config.model_dir / config.model_filename
        vectorizer_path = config.model_dir / config.vectorizer_filename
    
    model, vectorizer = ModelTrainer.load(model_path, vectorizer_path)
    
    # Preprocess text with validation
    preprocessor = TextPreprocessor()
    cleaned_text = preprocessor.transform(request.text)
    
    # Predict
    trainer = ModelTrainer(vectorizer=vectorizer, model=model)
    trainer.is_fitted = True  # Mark as fitted since we loaded it
    prediction = trainer.predict(pd.Series([cleaned_text]))[0]
    
    return PredictionResponse(
        text=request.text,
        prediction=prediction
    )

if __name__ == "__main__":
    import sys
    text = sys.argv[1] if len(sys.argv) > 1 else "This movie is great!"
    request = PredictionRequest(text=text)
    result = predict(request)
    print(f"Text: {result.text}")
    print(f"Prediction: {result.prediction}")

Key Refactoring Principles Applied

Separation of Concerns - Each module has a single responsibility
Dependency Injection - Components can be swapped easily
Error Handling - Proper exception handling throughout
Logging - Structured logging for debugging
Type Hints - Better code documentation and IDE support
Testability - Each component can be tested independently
Configuration Management - Centralized configuration
Reusability - Components can be reused in different contexts

Benefits of Refactored Code

Before (Notebook)	After (Refactored)
Hard to test	Easy to test each component
Hard to reuse	Reusable modules
Hard to maintain	Maintainable structure
Everything coupled together	Loose coupling
No error handling	Proper error handling
Hard-coded values	Configurable parameters
Not production-ready	Production-ready

Conclusion

Refactoring notebook code into modular Python code is essential for production ML systems. By applying software engineering principles, we create code that’s testable, maintainable, and reusable. The refactored code is easier to debug, extend, and deploy.

While notebooks are great for exploration, production code should be modular, tested, and well-organized. This refactoring approach can be applied to any ML project, making your codebase more professional and maintainable.

Thanks for reading!

Building a Simple Model Registry with S3 and DynamoDB

Building an Asynchronous Image Classification Pipeline with AWS Lambda