k-Nearest Neighbors (k-NN) Algorithm in Machine Learning
1. Introduction to k-Nearest Neighbors (k-NN)
k-Nearest Neighbors (k-NN) is a supervised learning algorithm used for classification and regression tasks. It is a non-parametric and instance-based learning algorithm, meaning it does not assume any underlying data distribution and makes predictions based on the similarity between new and existing data points.
π Why Use k-NN?
β Simple and easy to implement
β No training phase (lazy learning algorithm)
β Effective for small datasets
β Works well for both classification and regression
β Non-parametric (makes no assumptions about data distribution)
π Real-World Applications of k-NN
β
Medical Diagnosis (Predicting diseases based on symptoms)
β
Recommendation Systems (Suggesting movies, music, or books based on user preferences)
β
Handwriting Recognition (Digit classification in OCR systems)
β
Credit Risk Assessment (Identifying potential loan defaulters)
β
Image Classification (Classifying images in object detection)
2. How Does k-NN Work?
π The Main Idea of k-NN
The k-NN algorithm classifies a data point based on the majority class of its k-nearest neighbors. For regression tasks, it predicts the value based on the average (or weighted average) of its k-nearest neighbors.
π Steps of k-NN Algorithm
1οΈβ£ Choose the number of neighbors (k).
2οΈβ£ Calculate the distance between the new data point and all other points in the dataset.
3οΈβ£ Select the k-nearest neighbors (smallest distances).
4οΈβ£ For classification: Assign the most common class among the neighbors.
5οΈβ£ For regression: Compute the average value of the neighbors.
3. Distance Metrics Used in k-NN
To find the nearest neighbors, we need to measure the distance between data points. Common distance metrics include:
π 1οΈβ£ Euclidean Distance (Most Common)
The most widely used distance metric in k-NN. It calculates the straight-line distance between two points.
π Formula: d(A,B)=(x2βx1)2+(y2βy1)2d(A, B) = \sqrt{(x_2 – x_1)^2 + (y_2 – y_1)^2}
π 2οΈβ£ Manhattan Distance
Measures the sum of absolute differences between points.
π Formula: d(A,B)=β£x2βx1β£+β£y2βy1β£d(A, B) = |x_2 – x_1| + |y_2 – y_1|
π 3οΈβ£ Minkowski Distance
A generalized form of Euclidean and Manhattan distances.
π Formula: d(A,B)=(ββ£xiβyiβ£p)1/pd(A, B) = \left( \sum |x_i – y_i|^p \right)^{1/p}
When p = 1, it is Manhattan Distance.
When p = 2, it is Euclidean Distance.
π Choosing the right distance metric is crucial for model performance!
4. Choosing the Right Value of k
πΉ Too Small (e.g., k = 1) β Overfits the data, sensitive to noise.
πΉ Too Large (e.g., k = N) β Underfits the data, generalizes too much.
πΉ Optimal k β Typically found using Cross-Validation.
π Common approach: Choose k as an odd number to avoid ties.
Example of Choosing k
- If k = 3 and the neighbors belong to two classes, the class with the majority (2 out of 3) is assigned.
- If k = 10, a larger number of neighbors are considered, leading to a smoother decision boundary.
π A good rule of thumb: Choose k β sqrt(N) (where N = number of training samples).
5. Advantages & Disadvantages of k-NN
β Advantages
β Simple and easy to understand
β No training phase (fast model deployment)
β Works well with multi-class classification
β Can be used for both classification and regression
β Disadvantages
β Computationally expensive for large datasets
β Sensitive to irrelevant or redundant features
β Performance depends on choosing the right distance metric and k-value
6. Handling High-Dimensional Data in k-NN
πΉ Curse of Dimensionality: k-NN performs poorly when there are many irrelevant features.
πΉ Feature Scaling (Normalization or Standardization) is crucial for accurate distance measurement.
π Feature Scaling Techniques
- Min-Max Scaling (Rescales values between [0,1])
- Standardization (Z-score Normalization)
π Scaling ensures that no feature dominates the distance calculation!
7. Implementing k-NN in Python (Sklearn)
Let’s build a k-NN Classifier using the Scikit-Learn library.
π Step 1: Import Required Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
π Step 2: Load Data
# Sample Dataset
data = {'Feature1': [1, 2, 3, 4, 5, 6, 7, 8],
'Feature2': [2, 3, 4, 5, 6, 7, 8, 9],
'Class': [0, 0, 0, 1, 1, 1, 1, 1]}
df = pd.DataFrame(data)
# Features & Target
X = df[['Feature1', 'Feature2']]
y = df['Class']
π Step 3: Split Data into Training & Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
π Step 4: Apply Feature Scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
π Step 5: Train a k-NN Model
# Initialize k-NN Model with k=3
knn_model = KNeighborsClassifier(n_neighbors=3)
# Train the model
knn_model.fit(X_train, y_train)
π Step 6: Make Predictions & Evaluate
# Predict on test data
y_pred = knn_model.predict(X_test)
# Model Evaluation
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
print('Confusion Matrix:')
print(conf_matrix)
print('Classification Report:')
print(report)
π Tuning k and using cross-validation improves model performance!
8. k-NN vs Other Classification Algorithms
Feature | k-NN | Decision Tree | SVM | Logistic Regression |
---|---|---|---|---|
Training Time | Fast | Fast | Slow | Fast |
Prediction Time | Slow | Fast | Fast | Fast |
Works with Large Datasets | β | β | β | β |
Handles Non-Linear Data | β | β | β | β |
π k-NN is best suited for small datasets with clear patterns!
9. Summary
β k-NN is a simple and powerful supervised learning algorithm.
β It classifies a new data point based on the majority vote of k-nearest neighbors.
β Choosing the right k is crucial for model performance.
β Feature scaling is necessary to improve distance measurement.
β k-NN works best for small datasets but struggles with large, high-dimensional data.
Mastering k-NN is essential for solving real-world classification problems!