k-Nearest Neighbors (k-NN) Algorithm in Machine Learning

1. Introduction to k-Nearest Neighbors (k-NN)

k-Nearest Neighbors (k-NN) is a supervised learning algorithm used for classification and regression tasks. It is a non-parametric and instance-based learning algorithm, meaning it does not assume any underlying data distribution and makes predictions based on the similarity between new and existing data points.

📌 Why Use k-NN?

✔ Simple and easy to implement
✔ No training phase (lazy learning algorithm)
✔ Effective for small datasets
✔ Works well for both classification and regression
✔ Non-parametric (makes no assumptions about data distribution)

📌 Real-World Applications of k-NN

✅ Medical Diagnosis (Predicting diseases based on symptoms)
✅ Recommendation Systems (Suggesting movies, music, or books based on user preferences)
✅ Handwriting Recognition (Digit classification in OCR systems)
✅ Credit Risk Assessment (Identifying potential loan defaulters)
✅ Image Classification (Classifying images in object detection)

2. How Does k-NN Work?

🌟 The Main Idea of k-NN

The k-NN algorithm classifies a data point based on the majority class of its k-nearest neighbors. For regression tasks, it predicts the value based on the average (or weighted average) of its k-nearest neighbors.

📌 Steps of k-NN Algorithm

1️⃣ Choose the number of neighbors (k).
2️⃣ Calculate the distance between the new data point and all other points in the dataset.
3️⃣ Select the k-nearest neighbors (smallest distances).
4️⃣ For classification: Assign the most common class among the neighbors.
5️⃣ For regression: Compute the average value of the neighbors.

3. Distance Metrics Used in k-NN

To find the nearest neighbors, we need to measure the distance between data points. Common distance metrics include:

📌 1️⃣ Euclidean Distance (Most Common)

The most widely used distance metric in k-NN. It calculates the straight-line distance between two points.
📌 Formula: d(A,B)=(x2−x1)2+(y2−y1)2d(A, B) = \sqrt{(x_2 – x_1)^2 + (y_2 – y_1)^2}

📌 2️⃣ Manhattan Distance

Measures the sum of absolute differences between points.
📌 Formula: d(A,B)=∣x2−x1∣+∣y2−y1∣d(A, B) = |x_2 – x_1| + |y_2 – y_1|

📌 3️⃣ Minkowski Distance

A generalized form of Euclidean and Manhattan distances.
📌 Formula: d(A,B)=(∑∣xi−yi∣p)1/pd(A, B) = \left( \sum |x_i – y_i|^p \right)^{1/p}

When p = 1, it is Manhattan Distance.
When p = 2, it is Euclidean Distance.

📌 Choosing the right distance metric is crucial for model performance!

4. Choosing the Right Value of k

🔹 Too Small (e.g., k = 1) → Overfits the data, sensitive to noise.
🔹 Too Large (e.g., k = N) → Underfits the data, generalizes too much.
🔹 Optimal k → Typically found using Cross-Validation.

📌 Common approach: Choose k as an odd number to avoid ties.

Example of Choosing k

If k = 3 and the neighbors belong to two classes, the class with the majority (2 out of 3) is assigned.
If k = 10, a larger number of neighbors are considered, leading to a smoother decision boundary.

📌 A good rule of thumb: Choose k ≈ sqrt(N) (where N = number of training samples).

5. Advantages & Disadvantages of k-NN

✅ Advantages

✔ Simple and easy to understand
✔ No training phase (fast model deployment)
✔ Works well with multi-class classification
✔ Can be used for both classification and regression

❌ Disadvantages

❌ Computationally expensive for large datasets
❌ Sensitive to irrelevant or redundant features
❌ Performance depends on choosing the right distance metric and k-value

6. Handling High-Dimensional Data in k-NN

🔹 Curse of Dimensionality: k-NN performs poorly when there are many irrelevant features.
🔹 Feature Scaling (Normalization or Standardization) is crucial for accurate distance measurement.

📌 Feature Scaling Techniques

Min-Max Scaling (Rescales values between [0,1])
Standardization (Z-score Normalization)

📌 Scaling ensures that no feature dominates the distance calculation!

7. Implementing k-NN in Python (Sklearn)

Let’s build a k-NN Classifier using the Scikit-Learn library.

📌 Step 1: Import Required Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

📌 Step 2: Load Data

# Sample Dataset
data = {'Feature1': [1, 2, 3, 4, 5, 6, 7, 8],
        'Feature2': [2, 3, 4, 5, 6, 7, 8, 9],
        'Class': [0, 0, 0, 1, 1, 1, 1, 1]}

df = pd.DataFrame(data)

# Features & Target
X = df[['Feature1', 'Feature2']]
y = df['Class']

📌 Step 3: Split Data into Training & Testing Sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

📌 Step 4: Apply Feature Scaling

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

📌 Step 5: Train a k-NN Model

# Initialize k-NN Model with k=3
knn_model = KNeighborsClassifier(n_neighbors=3)

# Train the model
knn_model.fit(X_train, y_train)

📌 Step 6: Make Predictions & Evaluate

# Predict on test data
y_pred = knn_model.predict(X_test)

# Model Evaluation
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy:.2f}')
print('Confusion Matrix:')
print(conf_matrix)
print('Classification Report:')
print(report)

📌 Tuning k and using cross-validation improves model performance!

8. k-NN vs Other Classification Algorithms

Feature	k-NN	Decision Tree	SVM	Logistic Regression
Training Time	Fast	Fast	Slow	Fast
Prediction Time	Slow	Fast	Fast	Fast
Works with Large Datasets	❌	✅	✅	✅
Handles Non-Linear Data	✅	✅	✅	❌

📌 k-NN is best suited for small datasets with clear patterns!

9. Summary

✔ k-NN is a simple and powerful supervised learning algorithm.
✔ It classifies a new data point based on the majority vote of k-nearest neighbors.
✔ Choosing the right k is crucial for model performance.
✔ Feature scaling is necessary to improve distance measurement.
✔ k-NN works best for small datasets but struggles with large, high-dimensional data.

Mastering k-NN is essential for solving real-world classification problems!