Introduction
In this post, I’ll be implementing K-Nearest Neighbors (KNN) from scratch in Python. This is the third post in the “Machine Learning from Scratch” series.
KNN is one of the simplest machine learning algorithms and is often used as a baseline for more complex models. It’s intuitive, easy to understand, and requires no training phase.
K-Nearest Neighbors
K-Nearest Neighbors is a non-parametric algorithm that can be used for both classification and regression. The algorithm works by finding the K closest training examples to a new data point and making predictions based on those neighbors.
For classification, KNN assigns the most common class among the K nearest neighbors. For regression, it returns the average of the K nearest neighbors’ values.
The distance metric typically used is Euclidean distance, though other metrics like Manhattan distance can also be used.
Implementation
I’m using numpy for numerical computations and the Counter class from collections to find the most common class. For testing, I’ll use train_test_split and datasets from scikit-learn.
The KNN class has the following methods:
__init__: Constructor to set the number of neighbors K.fit: Method to store the training data (KNN is a lazy learner).predict: Method to make predictions by finding K nearest neighbors.
import numpy as np
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn import datasets
class KNN:
def __init__(self, k=3):
self.k = k
def fit(self, X, y):
self.X_train = X
self.y_train = y
def predict(self, X):
predictions = [self._predict(x) for x in X]
return np.array(predictions)
def _predict(self, x):
distances = [self._euclidean_distance(x, x_train) for x_train in self.X_train]
k_indices = np.argsort(distances)[:self.k]
k_nearest_labels = [self.y_train[i] for i in k_indices]
most_common = Counter(k_nearest_labels).most_common(1)
return most_common[0][0]
def _euclidean_distance(self, x1, x2):
return np.sqrt(np.sum((x1 - x2)**2))
Now let’s test the model on a classic dataset. I’ll use the Iris dataset and evaluate the model’s accuracy.
def accuracy(y_test, predictions):
return np.sum(y_test == predictions) / len(y_test)
if __name__ == '__main__':
iris = datasets.load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = KNN(k=5)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
acc = accuracy(y_test, predictions)
print(f"Accuracy: {acc}")
The model achieves excellent accuracy on the Iris dataset, correctly identifying most flower species based on their features.
One important consideration with KNN is choosing the right value of K. Too small a K leads to noise sensitivity, while too large a K makes the decision boundary less distinct. K is typically chosen as an odd number to avoid ties.
That’s all for this post. Thanks for reading!