k-Nearest Neighbors (k-NN)

Loading

k-Nearest Neighbors (k-NN) Algorithm in Machine Learning

1. Introduction to k-Nearest Neighbors (k-NN)

k-Nearest Neighbors (k-NN) is a supervised learning algorithm used for classification and regression tasks. It is a non-parametric and instance-based learning algorithm, meaning it does not assume any underlying data distribution and makes predictions based on the similarity between new and existing data points.

πŸ“Œ Why Use k-NN?

βœ” Simple and easy to implement
βœ” No training phase (lazy learning algorithm)
βœ” Effective for small datasets
βœ” Works well for both classification and regression
βœ” Non-parametric (makes no assumptions about data distribution)

πŸ“Œ Real-World Applications of k-NN

βœ… Medical Diagnosis (Predicting diseases based on symptoms)
βœ… Recommendation Systems (Suggesting movies, music, or books based on user preferences)
βœ… Handwriting Recognition (Digit classification in OCR systems)
βœ… Credit Risk Assessment (Identifying potential loan defaulters)
βœ… Image Classification (Classifying images in object detection)


2. How Does k-NN Work?

🌟 The Main Idea of k-NN

The k-NN algorithm classifies a data point based on the majority class of its k-nearest neighbors. For regression tasks, it predicts the value based on the average (or weighted average) of its k-nearest neighbors.

πŸ“Œ Steps of k-NN Algorithm

1️⃣ Choose the number of neighbors (k).
2️⃣ Calculate the distance between the new data point and all other points in the dataset.
3️⃣ Select the k-nearest neighbors (smallest distances).
4️⃣ For classification: Assign the most common class among the neighbors.
5️⃣ For regression: Compute the average value of the neighbors.


3. Distance Metrics Used in k-NN

To find the nearest neighbors, we need to measure the distance between data points. Common distance metrics include:

πŸ“Œ 1️⃣ Euclidean Distance (Most Common)

The most widely used distance metric in k-NN. It calculates the straight-line distance between two points.
πŸ“Œ Formula: d(A,B)=(x2βˆ’x1)2+(y2βˆ’y1)2d(A, B) = \sqrt{(x_2 – x_1)^2 + (y_2 – y_1)^2}

πŸ“Œ 2️⃣ Manhattan Distance

Measures the sum of absolute differences between points.
πŸ“Œ Formula: d(A,B)=∣x2βˆ’x1∣+∣y2βˆ’y1∣d(A, B) = |x_2 – x_1| + |y_2 – y_1|

πŸ“Œ 3️⃣ Minkowski Distance

A generalized form of Euclidean and Manhattan distances.
πŸ“Œ Formula: d(A,B)=(βˆ‘βˆ£xiβˆ’yi∣p)1/pd(A, B) = \left( \sum |x_i – y_i|^p \right)^{1/p}

When p = 1, it is Manhattan Distance.
When p = 2, it is Euclidean Distance.

πŸ“Œ Choosing the right distance metric is crucial for model performance!


4. Choosing the Right Value of k

πŸ”Ή Too Small (e.g., k = 1) β†’ Overfits the data, sensitive to noise.
πŸ”Ή Too Large (e.g., k = N) β†’ Underfits the data, generalizes too much.
πŸ”Ή Optimal k β†’ Typically found using Cross-Validation.

πŸ“Œ Common approach: Choose k as an odd number to avoid ties.

Example of Choosing k

  • If k = 3 and the neighbors belong to two classes, the class with the majority (2 out of 3) is assigned.
  • If k = 10, a larger number of neighbors are considered, leading to a smoother decision boundary.

πŸ“Œ A good rule of thumb: Choose k β‰ˆ sqrt(N) (where N = number of training samples).


5. Advantages & Disadvantages of k-NN

βœ… Advantages

βœ” Simple and easy to understand
βœ” No training phase (fast model deployment)
βœ” Works well with multi-class classification
βœ” Can be used for both classification and regression

❌ Disadvantages

❌ Computationally expensive for large datasets
❌ Sensitive to irrelevant or redundant features
❌ Performance depends on choosing the right distance metric and k-value


6. Handling High-Dimensional Data in k-NN

πŸ”Ή Curse of Dimensionality: k-NN performs poorly when there are many irrelevant features.
πŸ”Ή Feature Scaling (Normalization or Standardization) is crucial for accurate distance measurement.

πŸ“Œ Feature Scaling Techniques

  • Min-Max Scaling (Rescales values between [0,1])
  • Standardization (Z-score Normalization)

πŸ“Œ Scaling ensures that no feature dominates the distance calculation!


7. Implementing k-NN in Python (Sklearn)

Let’s build a k-NN Classifier using the Scikit-Learn library.

πŸ“Œ Step 1: Import Required Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

πŸ“Œ Step 2: Load Data

# Sample Dataset
data = {'Feature1': [1, 2, 3, 4, 5, 6, 7, 8],
        'Feature2': [2, 3, 4, 5, 6, 7, 8, 9],
        'Class': [0, 0, 0, 1, 1, 1, 1, 1]}

df = pd.DataFrame(data)

# Features & Target
X = df[['Feature1', 'Feature2']]
y = df['Class']

πŸ“Œ Step 3: Split Data into Training & Testing Sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

πŸ“Œ Step 4: Apply Feature Scaling

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

πŸ“Œ Step 5: Train a k-NN Model

# Initialize k-NN Model with k=3
knn_model = KNeighborsClassifier(n_neighbors=3)

# Train the model
knn_model.fit(X_train, y_train)

πŸ“Œ Step 6: Make Predictions & Evaluate

# Predict on test data
y_pred = knn_model.predict(X_test)

# Model Evaluation
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy:.2f}')
print('Confusion Matrix:')
print(conf_matrix)
print('Classification Report:')
print(report)

πŸ“Œ Tuning k and using cross-validation improves model performance!


8. k-NN vs Other Classification Algorithms

Featurek-NNDecision TreeSVMLogistic Regression
Training TimeFastFastSlowFast
Prediction TimeSlowFastFastFast
Works with Large DatasetsβŒβœ…βœ…βœ…
Handles Non-Linear Dataβœ…βœ…βœ…βŒ

πŸ“Œ k-NN is best suited for small datasets with clear patterns!


9. Summary

βœ” k-NN is a simple and powerful supervised learning algorithm.
βœ” It classifies a new data point based on the majority vote of k-nearest neighbors.
βœ” Choosing the right k is crucial for model performance.
βœ” Feature scaling is necessary to improve distance measurement.
βœ” k-NN works best for small datasets but struggles with large, high-dimensional data.

Mastering k-NN is essential for solving real-world classification problems!

Leave a Reply

Your email address will not be published. Required fields are marked *