4 min read

Machine Learning from Scratch: Convolutional Neural Network

Table of Contents

Introduction

In this post, I’ll be implementing a basic Convolutional Neural Network (CNN) from scratch in Python. This is the eleventh post in the “Machine Learning from Scratch” series.

CNNs revolutionized computer vision by automatically learning spatial hierarchies of features from images. They’re specifically designed to process grid-like data such as images by using convolution operations instead of fully connected layers.

Convolutional Neural Network

A CNN uses convolutional layers that apply filters to detect features like edges, textures, and patterns. The key components are:

  1. Convolutional Layer: Applies filters to extract features from input
  2. Activation Function: Introduces non-linearity (typically ReLU)
  3. Pooling Layer: Reduces spatial dimensions while retaining important features
  4. Fully Connected Layer: Makes final predictions based on learned features

Convolution operations share weights across the spatial dimensions, making CNNs parameter-efficient and translation-invariant.

Implementation

I’m using numpy for numerical computations. This is a simplified CNN with basic convolution and max pooling operations for demonstration purposes.

The implementation includes helper functions for convolution and pooling, plus a simple CNN class:

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets

def conv2d(image, kernel):
    image_height, image_width = image.shape
    kernel_height, kernel_width = kernel.shape
    
    output_height = image_height - kernel_height + 1
    output_width = image_width - kernel_width + 1
    
    output = np.zeros((output_height, output_width))
    
    for i in range(output_height):
        for j in range(output_width):
            region = image[i:i+kernel_height, j:j+kernel_width]
            output[i, j] = np.sum(region * kernel)
    
    return output


def max_pool2d(image, pool_size=2):
    image_height, image_width = image.shape
    
    output_height = image_height // pool_size
    output_width = image_width // pool_size
    
    output = np.zeros((output_height, output_width))
    
    for i in range(output_height):
        for j in range(output_width):
            region = image[
                i*pool_size:(i+1)*pool_size,
                j*pool_size:(j+1)*pool_size
            ]
            output[i, j] = np.max(region)
    
    return output


def relu(x):
    return np.maximum(0, x)


class CNN:
    def __init__(self, num_filters=8, filter_size=3, pool_size=2, num_classes=10, lr=0.01, num_iter=100):
        self.num_filters = num_filters
        self.filter_size = filter_size
        self.pool_size = pool_size
        self.num_classes = num_classes
        self.lr = lr
        self.num_iter = num_iter
        
        self.filters = np.random.randn(num_filters, filter_size, filter_size) * 0.1
        self.fc_weights = None
        self.fc_bias = None

    def _forward_conv_pool(self, image):
        conv_outputs = []
        for f in range(self.num_filters):
            conv_out = conv2d(image, self.filters[f])
            conv_out = relu(conv_out)
            pooled = max_pool2d(conv_out, self.pool_size)
            conv_outputs.append(pooled)
        
        return np.array(conv_outputs).flatten()

    def fit(self, X, y):
        num_samples = X.shape[0]
        
        first_sample_features = self._forward_conv_pool(X[0])
        feature_size = len(first_sample_features)
        
        self.fc_weights = np.random.randn(feature_size, self.num_classes) * 0.01
        self.fc_bias = np.zeros((1, self.num_classes))
        
        for iteration in range(self.num_iter):
            for idx in range(num_samples):
                features = self._forward_conv_pool(X[idx])
                
                logits = np.dot(features, self.fc_weights) + self.fc_bias
                exp_logits = np.exp(logits - np.max(logits))
                probs = exp_logits / np.sum(exp_logits)
                
                target = np.zeros((1, self.num_classes))
                target[0, y[idx]] = 1
                
                dlogits = probs - target
                
                dW = np.outer(features, dlogits)
                db = dlogits
                
                self.fc_weights -= self.lr * dW
                self.fc_bias -= self.lr * db

    def predict(self, X):
        predictions = []
        for idx in range(X.shape[0]):
            features = self._forward_conv_pool(X[idx])
            logits = np.dot(features, self.fc_weights) + self.fc_bias
            pred = np.argmax(logits)
            predictions.append(pred)
        
        return np.array(predictions)

Now let’s test the CNN on the digits dataset.

def accuracy(y_test, predictions):
    return np.sum(y_test == predictions) / len(y_test)


if __name__ == '__main__':
    digits = datasets.load_digits()
    X = digits.images
    y = digits.target
    
    X = X / 16.0
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    model = CNN(num_filters=4, filter_size=3, pool_size=2, num_classes=10, lr=0.01, num_iter=50)
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)

    acc = accuracy(y_test, predictions)
    print(f"Accuracy: {acc}")

This basic CNN learns to recognize handwritten digits. Modern CNNs use many more sophisticated techniques including batch normalization, dropout, multiple convolutional layers, residual connections, and advanced optimizers. Frameworks like PyTorch and TensorFlow implement these efficiently with GPU acceleration.

Building a CNN from scratch demonstrates the core concepts of convolution and pooling, but for practical applications, using established frameworks is recommended.

That’s all for this post. Thanks for reading!