3 min read

Machine Learning from Scratch: Naive Bayes

Table of Contents

Introduction

In this post, I’ll be implementing the Naive Bayes classifier from scratch in Python. This is the fourth post in the “Machine Learning from Scratch” series.

Naive Bayes is a probabilistic classifier based on Bayes’ theorem with strong independence assumptions between features. Despite its simplicity, it performs surprisingly well on many real-world problems, especially text classification.

Naive Bayes

Naive Bayes applies Bayes’ theorem with the naive assumption that all features are independent of each other given the class label. While this assumption is rarely true in practice, the algorithm often works well anyway.

Bayes’ theorem states:

P(y|X) = P(X|y) * P(y) / P(X)

For classification, we calculate the posterior probability for each class and choose the class with the highest probability. The “naive” assumption allows us to compute P(X|y) as the product of individual feature probabilities.

Implementation

I’m using numpy for numerical computations. For testing, I’ll use train_test_split and datasets from scikit-learn.

The NaiveBayes class has the following methods:

  • __init__: Constructor for the class.
  • fit: Method to calculate prior probabilities and feature statistics from training data.
  • predict: Method to make predictions using Bayes’ theorem.
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets

class NaiveBayes:
    def fit(self, X, y):
        num_samples, num_features = X.shape
        self.classes = np.unique(y)
        num_classes = len(self.classes)

        self.mean = np.zeros((num_classes, num_features), dtype=np.float64)
        self.var = np.zeros((num_classes, num_features), dtype=np.float64)
        self.priors = np.zeros(num_classes, dtype=np.float64)

        for idx, c in enumerate(self.classes):
            X_c = X[y == c]
            self.mean[idx, :] = X_c.mean(axis=0)
            self.var[idx, :] = X_c.var(axis=0)
            self.priors[idx] = X_c.shape[0] / float(num_samples)

    def predict(self, X):
        y_pred = [self._predict(x) for x in X]
        return np.array(y_pred)

    def _predict(self, x):
        posteriors = []

        for idx, c in enumerate(self.classes):
            prior = np.log(self.priors[idx])
            posterior = np.sum(np.log(self._pdf(idx, x)))
            posterior = prior + posterior
            posteriors.append(posterior)

        return self.classes[np.argmax(posteriors)]

    def _pdf(self, class_idx, x):
        mean = self.mean[class_idx]
        var = self.var[class_idx]
        numerator = np.exp(-(x - mean) ** 2 / (2 * var))
        denominator = np.sqrt(2 * np.pi * var)
        return numerator / denominator

Now let’s test the model on a classification dataset. I’ll use make_classification to generate synthetic data.

def accuracy(y_test, predictions):
    return np.sum(y_test == predictions) / len(y_test)


if __name__ == '__main__':
    X, y = datasets.make_classification(
        n_samples=1000, n_features=10, n_classes=2, random_state=42
    )
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    model = NaiveBayes()
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)

    acc = accuracy(y_test, predictions)
    print(f"Accuracy: {acc}")

The model achieves strong accuracy on the test set. Naive Bayes is particularly effective when the independence assumption roughly holds, or when you have limited training data. It’s also very fast to train and makes predictions quickly.

That’s all for this post. Thanks for reading!