12 min read

Refactoring ML Code: From Notebooks to Production

Table of Contents

Introduction

Jupyter notebooks are great for experimentation and exploration, but they’re not ideal for production code. Notebooks often contain everything in one place, with data loading, preprocessing, model training, and evaluation all mixed together. This makes code hard to test, reuse, and maintain.

In this post, I’ll walk through refactoring a typical ML notebook into clean, modular, production-ready Python code. We’ll apply software engineering principles like separation of concerns, dependency injection, and proper error handling. We’ll also use Pydantic v2 for data validation and type safety throughout the codebase. By the end, you’ll have a codebase that’s testable, maintainable, and ready for deployment.

The complete code for this project is available on GitHub: ml-refactoring-example

The Problem with Notebook Code

Let’s start with a typical notebook example. This code trains a sentiment classifier, but it has several issues:

# Everything in one cell - typical notebook style
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import pickle

# Load data
df = pd.read_csv('sentiment_data.csv')
print(f"Loaded {len(df)} rows")

# Preprocess
df['text'] = df['text'].str.lower()
df['text'] = df['text'].str.replace('[^a-zA-Z0-9\s]', '', regex=True)

# Split data
X = df['text']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
vectorizer = TfidfVectorizer(max_features=5000)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

model = LogisticRegression()
model.fit(X_train_vec, y_train)

# Evaluate
y_pred = model.predict(X_test_vec)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(classification_report(y_test, y_pred))

# Save model
with open('model.pkl', 'wb') as f:
    pickle.dump(model, f)
with open('vectorizer.pkl', 'wb') as f:
    pickle.dump(vectorizer, f)

Problems with this code:

  • Everything is in one place (hard to test individual components)
  • Hard-coded file paths and parameters
  • No error handling
  • No logging
  • Difficult to reuse preprocessing logic
  • Can’t easily swap different models or vectorizers
  • Hard to test individual functions

Refactoring Strategy

We’ll refactor this into a modular structure with:

  1. Data loading module - Handles data I/O
  2. Preprocessing module - Text cleaning and transformation
  3. Model training module - Model creation and training
  4. Evaluation module - Metrics and reporting
  5. Configuration - Centralized settings
  6. Main script - Orchestrates everything

Step-by-Step Refactoring

1. Configuration Management

First, let’s extract all configuration into a separate file using Pydantic v2:

# config.py
from pydantic import BaseModel, Field, field_validator
from pathlib import Path
from typing import Optional

class Config(BaseModel):
    """Configuration model with validation."""
    
    # Data paths
    data_path: str = Field(default="data/sentiment_data.csv", description="Path to training data")
    model_dir: Path = Field(default=Path("models"), description="Directory for saved models")
    
    # Model parameters
    test_size: float = Field(default=0.2, ge=0.0, le=1.0, description="Test set proportion")
    random_state: int = Field(default=42, description="Random seed for reproducibility")
    max_features: int = Field(default=5000, gt=0, description="Maximum features for vectorizer")
    
    # Output paths
    model_filename: str = Field(default="sentiment_classifier.pkl", description="Model filename")
    vectorizer_filename: str = Field(default="vectorizer.pkl", description="Vectorizer filename")
    
    @field_validator('model_dir', mode='after')
    @classmethod
    def create_model_dir(cls, v: Path) -> Path:
        """Create model directory if it doesn't exist."""
        v.mkdir(parents=True, exist_ok=True)
        return v
    
    @field_validator('data_path')
    @classmethod
    def validate_data_path(cls, v: str) -> str:
        """Validate data path exists."""
        if not Path(v).exists():
            raise ValueError(f"Data file not found: {v}")
        return v
    
    model_config = {
        "frozen": False,  # Allow mutation if needed
        "validate_assignment": True,  # Validate on assignment
    }

Benefits:

  • All configuration in one place
  • Automatic validation of values (test_size between 0 and 1, max_features > 0)
  • Can load from environment variables or config files
  • Type-safe with Pydantic models
  • Better error messages for invalid configuration

2. Data Loading Module

Extract data loading into a dedicated module with Pydantic validation:

# data_loader.py
import pandas as pd
from pathlib import Path
from typing import Tuple
import logging
from pydantic import BaseModel, Field, field_validator

logger = logging.getLogger(__name__)

class DataConfig(BaseModel):
    """Configuration for data loading."""
    file_path: str = Field(..., description="Path to data file")
    text_column: str = Field(default="text", description="Name of text column")
    label_column: str = Field(default="label", description="Name of label column")
    
    @field_validator('file_path')
    @classmethod
    def validate_file_exists(cls, v: str) -> str:
        """Validate that file exists."""
        path = Path(v)
        if not path.exists():
            raise FileNotFoundError(f"Data file not found: {v}")
        return v

class SplitConfig(BaseModel):
    """Configuration for train-test split."""
    test_size: float = Field(default=0.2, ge=0.0, le=1.0)
    random_state: int = Field(default=42)

def load_data(config: DataConfig) -> pd.DataFrame:
    """Load data from CSV file with validation."""
    try:
        path = Path(config.file_path)
        df = pd.read_csv(path)
        
        # Validate required columns exist
        if config.text_column not in df.columns:
            raise ValueError(f"Text column '{config.text_column}' not found in data")
        if config.label_column not in df.columns:
            raise ValueError(f"Label column '{config.label_column}' not found in data")
        
        logger.info(f"Loaded {len(df)} rows from {config.file_path}")
        return df
    except Exception as e:
        logger.error(f"Error loading data: {e}")
        raise

def split_data(
    X: pd.Series, 
    y: pd.Series, 
    config: SplitConfig
) -> Tuple[pd.Series, pd.Series, pd.Series, pd.Series]:
    """Split data into training and testing sets."""
    from sklearn.model_selection import train_test_split
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=config.test_size, random_state=config.random_state
    )
    logger.info(f"Split data: {len(X_train)} train, {len(X_test)} test")
    return X_train, X_test, y_train, y_test

Benefits:

  • Reusable data loading logic
  • Proper error handling
  • Logging for debugging
  • Type hints for clarity
  • Easy to test

3. Preprocessing Module

Separate preprocessing into its own module with Pydantic models:

# preprocessing.py
import pandas as pd
import re
from typing import Callable
import logging
from pydantic import BaseModel, Field, field_validator

logger = logging.getLogger(__name__)

class TextInput(BaseModel):
    """Input model for text validation."""
    text: str = Field(..., min_length=1, description="Text to preprocess")
    
    @field_validator('text')
    @classmethod
    def validate_text(cls, v: str) -> str:
        """Validate and clean text input."""
        if not v or not v.strip():
            raise ValueError("Text cannot be empty")
        return v.strip()

def clean_text(text: str) -> str:
    """Clean text by lowercasing and removing special characters."""
    text = text.lower()
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    return text

def preprocess_dataframe(df: pd.DataFrame, text_column: str = 'text') -> pd.DataFrame:
    """Preprocess text column in dataframe."""
    df = df.copy()
    df[text_column] = df[text_column].apply(clean_text)
    logger.info(f"Preprocessed {len(df)} rows")
    return df

class TextPreprocessor:
    """Preprocessor class for dependency injection."""
    
    def __init__(self, cleaning_fn: Callable[[str], str] = clean_text):
        self.cleaning_fn = cleaning_fn
    
    def transform(self, text: str) -> str:
        """Transform a single text with validation."""
        # Validate input using Pydantic
        validated = TextInput(text=text)
        return self.cleaning_fn(validated.text)
    
    def transform_batch(self, texts: pd.Series) -> pd.Series:
        """Transform a series of texts."""
        return texts.apply(self.transform)

Benefits:

  • Preprocessing logic is reusable
  • Can inject different cleaning functions
  • Easy to test individual functions
  • Can extend with more preprocessing steps

4. Model Training Module

Create a flexible model training module:

# model_trainer.py
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.base import BaseEstimator
from typing import Tuple, Any
import logging
import pickle
from pathlib import Path

logger = logging.getLogger(__name__)

class ModelTrainer:
    """Trainer class for ML models."""
    
    def __init__(
        self,
        vectorizer: Any = None,
        model: BaseEstimator = None,
        max_features: int = 5000
    ):
        self.vectorizer = vectorizer or TfidfVectorizer(max_features=max_features)
        self.model = model or LogisticRegression()
        self.is_fitted = False
    
    def fit(self, X_train: pd.Series, y_train: pd.Series) -> 'ModelTrainer':
        """Train the model."""
        logger.info("Fitting vectorizer and model...")
        X_train_vec = self.vectorizer.fit_transform(X_train)
        self.model.fit(X_train_vec, y_train)
        self.is_fitted = True
        logger.info("Model training complete")
        return self
    
    def transform(self, X: pd.Series) -> Any:
        """Transform text to features."""
        if not self.is_fitted:
            raise ValueError("Model must be fitted before transform")
        return self.vectorizer.transform(X)
    
    def predict(self, X: pd.Series) -> Any:
        """Make predictions."""
        X_vec = self.transform(X)
        return self.model.predict(X_vec)
    
    def save(self, model_path: Path, vectorizer_path: Path) -> None:
        """Save model and vectorizer."""
        with open(model_path, 'wb') as f:
            pickle.dump(self.model, f)
        with open(vectorizer_path, 'wb') as f:
            pickle.dump(self.vectorizer, f)
        logger.info(f"Saved model to {model_path} and vectorizer to {vectorizer_path}")
    
    @classmethod
    def load(cls, model_path: Path, vectorizer_path: Path) -> Tuple[Any, Any]:
        """Load model and vectorizer."""
        with open(model_path, 'rb') as f:
            model = pickle.load(f)
        with open(vectorizer_path, 'rb') as f:
            vectorizer = pickle.load(f)
        logger.info(f"Loaded model from {model_path} and vectorizer from {vectorizer_path}")
        return model, vectorizer

Benefits:

  • Can inject different vectorizers or models
  • Encapsulates training logic
  • Easy to extend with new model types
  • Proper state management (is_fitted flag)
  • Save/load functionality

5. Evaluation Module

Separate evaluation logic:

# evaluator.py
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from typing import Any, Dict
import logging

logger = logging.getLogger(__name__)

class ModelEvaluator:
    """Evaluate model performance."""
    
    @staticmethod
    def evaluate(y_true: Any, y_pred: Any) -> Dict[str, Any]:
        """Calculate evaluation metrics."""
        accuracy = accuracy_score(y_true, y_pred)
        report = classification_report(y_true, y_pred, output_dict=True)
        
        metrics = {
            'accuracy': accuracy,
            'classification_report': report
        }
        
        logger.info(f"Model accuracy: {accuracy:.4f}")
        return metrics
    
    @staticmethod
    def print_report(y_true: Any, y_pred: Any) -> None:
        """Print detailed classification report."""
        print(classification_report(y_true, y_pred))
        print("\nConfusion Matrix:")
        print(confusion_matrix(y_true, y_pred))

Benefits:

  • Reusable evaluation logic
  • Can extend with more metrics
  • Separates evaluation from training
  • Easy to test

6. Main Script

Now we can create a clean main script that orchestrates everything:

# train.py
import logging
from pathlib import Path
from config import Config
from data_loader import load_data, split_data, DataConfig, SplitConfig
from preprocessing import preprocess_dataframe
from model_trainer import ModelTrainer
from evaluator import ModelEvaluator

# Setup logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

def main():
    """Main training pipeline."""
    # Load and validate configuration
    config = Config()
    
    # Load data with validation
    data_config = DataConfig(file_path=config.data_path)
    df = load_data(data_config)
    
    # Preprocess
    df = preprocess_dataframe(df, text_column=data_config.text_column)
    
    # Prepare features and labels
    X = df[data_config.text_column]
    y = df[data_config.label_column]
    
    # Split data with validation
    split_config = SplitConfig(
        test_size=config.test_size,
        random_state=config.random_state
    )
    X_train, X_test, y_train, y_test = split_data(X, y, split_config)
    
    # Train model
    trainer = ModelTrainer(max_features=config.max_features)
    trainer.fit(X_train, y_train)
    
    # Evaluate
    y_pred = trainer.predict(X_test)
    evaluator = ModelEvaluator()
    metrics = evaluator.evaluate(y_test, y_pred)
    evaluator.print_report(y_test, y_pred)
    
    # Save model
    model_path = config.model_dir / config.model_filename
    vectorizer_path = config.model_dir / config.vectorizer_filename
    trainer.save(model_path, vectorizer_path)
    
    logging.info("Training pipeline complete")

if __name__ == "__main__":
    main()

Benefits:

  • Clean, readable main script
  • Easy to understand the pipeline
  • Each step is modular and testable
  • Can easily add new steps

Project Structure

After refactoring, your project structure looks like this:

sentiment-classifier/
├── config.py           # Configuration
├── data_loader.py      # Data loading
├── preprocessing.py    # Text preprocessing
├── model_trainer.py    # Model training
├── evaluator.py        # Model evaluation
├── train.py           # Main training script
├── predict.py         # Inference script (we'll add this)
└── tests/             # Unit tests
    ├── test_preprocessing.py
    ├── test_model_trainer.py
    └── test_evaluator.py

Adding Tests

Now that code is modular, we can easily write tests:

# tests/test_preprocessing.py
import pytest
from preprocessing import clean_text, preprocess_dataframe
import pandas as pd

def test_clean_text():
    """Test text cleaning function."""
    assert clean_text("Hello World!") == "hello world"
    assert clean_text("Test@123") == "test123"
    assert clean_text("") == ""

def test_preprocess_dataframe():
    """Test dataframe preprocessing."""
    df = pd.DataFrame({
        'text': ['Hello World!', 'Test@123'],
        'label': [1, 0]
    })
    result = preprocess_dataframe(df)
    assert result['text'].iloc[0] == "hello world"
    assert result['text'].iloc[1] == "test123"
# tests/test_model_trainer.py
import pytest
import pandas as pd
from model_trainer import ModelTrainer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

def test_model_trainer_fit():
    """Test model training."""
    X_train = pd.Series(["good movie", "bad movie", "great film"])
    y_train = pd.Series([1, 0, 1])
    
    trainer = ModelTrainer()
    trainer.fit(X_train, y_train)
    
    assert trainer.is_fitted == True

def test_model_trainer_predict():
    """Test model prediction."""
    X_train = pd.Series(["good", "bad"])
    y_train = pd.Series([1, 0])
    X_test = pd.Series(["good movie"])
    
    trainer = ModelTrainer()
    trainer.fit(X_train, y_train)
    predictions = trainer.predict(X_test)
    
    assert len(predictions) == 1

Adding an Inference Script

With the refactored code, creating an inference script is straightforward. We’ll use Pydantic for input validation:

# predict.py
import logging
from pathlib import Path
from model_trainer import ModelTrainer
from preprocessing import TextPreprocessor, TextInput
from config import Config
from pydantic import BaseModel, Field, field_validator
import pandas as pd

logging.basicConfig(level=logging.INFO)

class PredictionRequest(BaseModel):
    """Request model for predictions."""
    text: str = Field(..., min_length=1, description="Text to classify")
    model_path: Path | None = Field(default=None, description="Optional custom model path")
    
    @field_validator('text')
    @classmethod
    def validate_text(cls, v: str) -> str:
        """Validate text input."""
        if not v or not v.strip():
            raise ValueError("Text cannot be empty")
        return v.strip()

class PredictionResponse(BaseModel):
    """Response model for predictions."""
    text: str = Field(..., description="Original text")
    prediction: str = Field(..., description="Predicted label")
    confidence: float | None = Field(default=None, description="Prediction confidence")

def predict(request: PredictionRequest, config: Config | None = None) -> PredictionResponse:
    """Predict sentiment for a single text with validation."""
    if config is None:
        config = Config()
    
    # Load model and vectorizer
    if request.model_path:
        model_path = request.model_path
        vectorizer_path = request.model_path.parent / config.vectorizer_filename
    else:
        model_path = config.model_dir / config.model_filename
        vectorizer_path = config.model_dir / config.vectorizer_filename
    
    model, vectorizer = ModelTrainer.load(model_path, vectorizer_path)
    
    # Preprocess text with validation
    preprocessor = TextPreprocessor()
    cleaned_text = preprocessor.transform(request.text)
    
    # Predict
    trainer = ModelTrainer(vectorizer=vectorizer, model=model)
    trainer.is_fitted = True  # Mark as fitted since we loaded it
    prediction = trainer.predict(pd.Series([cleaned_text]))[0]
    
    return PredictionResponse(
        text=request.text,
        prediction=prediction
    )

if __name__ == "__main__":
    import sys
    text = sys.argv[1] if len(sys.argv) > 1 else "This movie is great!"
    request = PredictionRequest(text=text)
    result = predict(request)
    print(f"Text: {result.text}")
    print(f"Prediction: {result.prediction}")

Key Refactoring Principles Applied

  1. Separation of Concerns - Each module has a single responsibility
  2. Dependency Injection - Components can be swapped easily
  3. Error Handling - Proper exception handling throughout
  4. Logging - Structured logging for debugging
  5. Type Hints - Better code documentation and IDE support
  6. Testability - Each component can be tested independently
  7. Configuration Management - Centralized configuration
  8. Reusability - Components can be reused in different contexts

Benefits of Refactored Code

Before (Notebook)After (Refactored)
Hard to testEasy to test each component
Hard to reuseReusable modules
Hard to maintainMaintainable structure
Everything coupled togetherLoose coupling
No error handlingProper error handling
Hard-coded valuesConfigurable parameters
Not production-readyProduction-ready

Conclusion

Refactoring notebook code into modular Python code is essential for production ML systems. By applying software engineering principles, we create code that’s testable, maintainable, and reusable. The refactored code is easier to debug, extend, and deploy.

While notebooks are great for exploration, production code should be modular, tested, and well-organized. This refactoring approach can be applied to any ML project, making your codebase more professional and maintainable.

Thanks for reading!