Introduction
In this post, I’ll be implementing a basic Convolutional Neural Network (CNN) from scratch in Python. This is the eleventh post in the “Machine Learning from Scratch” series.
CNNs revolutionized computer vision by automatically learning spatial hierarchies of features from images. They’re specifically designed to process grid-like data such as images by using convolution operations instead of fully connected layers.
Convolutional Neural Network
A CNN uses convolutional layers that apply filters to detect features like edges, textures, and patterns. The key components are:
- Convolutional Layer: Applies filters to extract features from input
- Activation Function: Introduces non-linearity (typically ReLU)
- Pooling Layer: Reduces spatial dimensions while retaining important features
- Fully Connected Layer: Makes final predictions based on learned features
Convolution operations share weights across the spatial dimensions, making CNNs parameter-efficient and translation-invariant.
Implementation
I’m using numpy for numerical computations. This is a simplified CNN with basic convolution and max pooling operations for demonstration purposes.
The implementation includes helper functions for convolution and pooling, plus a simple CNN class:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
def conv2d(image, kernel):
image_height, image_width = image.shape
kernel_height, kernel_width = kernel.shape
output_height = image_height - kernel_height + 1
output_width = image_width - kernel_width + 1
output = np.zeros((output_height, output_width))
for i in range(output_height):
for j in range(output_width):
region = image[i:i+kernel_height, j:j+kernel_width]
output[i, j] = np.sum(region * kernel)
return output
def max_pool2d(image, pool_size=2):
image_height, image_width = image.shape
output_height = image_height // pool_size
output_width = image_width // pool_size
output = np.zeros((output_height, output_width))
for i in range(output_height):
for j in range(output_width):
region = image[
i*pool_size:(i+1)*pool_size,
j*pool_size:(j+1)*pool_size
]
output[i, j] = np.max(region)
return output
def relu(x):
return np.maximum(0, x)
class CNN:
def __init__(self, num_filters=8, filter_size=3, pool_size=2, num_classes=10, lr=0.01, num_iter=100):
self.num_filters = num_filters
self.filter_size = filter_size
self.pool_size = pool_size
self.num_classes = num_classes
self.lr = lr
self.num_iter = num_iter
self.filters = np.random.randn(num_filters, filter_size, filter_size) * 0.1
self.fc_weights = None
self.fc_bias = None
def _forward_conv_pool(self, image):
conv_outputs = []
for f in range(self.num_filters):
conv_out = conv2d(image, self.filters[f])
conv_out = relu(conv_out)
pooled = max_pool2d(conv_out, self.pool_size)
conv_outputs.append(pooled)
return np.array(conv_outputs).flatten()
def fit(self, X, y):
num_samples = X.shape[0]
first_sample_features = self._forward_conv_pool(X[0])
feature_size = len(first_sample_features)
self.fc_weights = np.random.randn(feature_size, self.num_classes) * 0.01
self.fc_bias = np.zeros((1, self.num_classes))
for iteration in range(self.num_iter):
for idx in range(num_samples):
features = self._forward_conv_pool(X[idx])
logits = np.dot(features, self.fc_weights) + self.fc_bias
exp_logits = np.exp(logits - np.max(logits))
probs = exp_logits / np.sum(exp_logits)
target = np.zeros((1, self.num_classes))
target[0, y[idx]] = 1
dlogits = probs - target
dW = np.outer(features, dlogits)
db = dlogits
self.fc_weights -= self.lr * dW
self.fc_bias -= self.lr * db
def predict(self, X):
predictions = []
for idx in range(X.shape[0]):
features = self._forward_conv_pool(X[idx])
logits = np.dot(features, self.fc_weights) + self.fc_bias
pred = np.argmax(logits)
predictions.append(pred)
return np.array(predictions)
Now let’s test the CNN on the digits dataset.
def accuracy(y_test, predictions):
return np.sum(y_test == predictions) / len(y_test)
if __name__ == '__main__':
digits = datasets.load_digits()
X = digits.images
y = digits.target
X = X / 16.0
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = CNN(num_filters=4, filter_size=3, pool_size=2, num_classes=10, lr=0.01, num_iter=50)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
acc = accuracy(y_test, predictions)
print(f"Accuracy: {acc}")
This basic CNN learns to recognize handwritten digits. Modern CNNs use many more sophisticated techniques including batch normalization, dropout, multiple convolutional layers, residual connections, and advanced optimizers. Frameworks like PyTorch and TensorFlow implement these efficiently with GPU acceleration.
Building a CNN from scratch demonstrates the core concepts of convolution and pooling, but for practical applications, using established frameworks is recommended.
That’s all for this post. Thanks for reading!