Object Detection with YOLO (You Only Look Once) – Detailed Explanation
Introduction to Object Detection
Object detection is a computer vision task that identifies and locates objects within images or videos. Unlike image classification, which only assigns a class label to an image, object detection provides both classification and localization of multiple objects in a single frame.
Traditional Object Detection Methods
Before YOLO, object detection relied on region-based methods such as:
- R-CNN (Region-Based Convolutional Neural Networks) – Generates region proposals and classifies them separately.
- Fast R-CNN – Improves R-CNN by sharing convolutional layers to speed up inference.
- Faster R-CNN – Introduces Region Proposal Networks (RPNs) to make detection faster.
While these methods were accurate, they were computationally expensive and slow for real-time applications.
Introduction to YOLO (You Only Look Once)
YOLO is a real-time object detection algorithm that significantly improves speed without sacrificing much accuracy. Unlike previous methods, which scan the image multiple times, YOLO looks at the image only once and predicts bounding boxes and class labels in a single pass.
Key Features of YOLO:
- Single forward pass through the network for detection.
- Grid-based prediction instead of region proposals.
- Fast inference speed, making it ideal for real-time applications.
- End-to-end training rather than separate classification and localization steps.
How YOLO Works – Step by Step
1. Input Image Processing
The input image is resized to a fixed dimension (e.g., 416×416 pixels) and passed through a CNN. The image is divided into a grid of S×S cells (e.g., 7×7).
2. Grid-Based Detection
Each cell in the grid is responsible for detecting objects that have their center within that cell. The network predicts:
- Bounding boxes (x, y, width, height)
- Confidence scores (likelihood of object presence)
- Class probabilities (which object category)
A typical YOLO output tensor consists of: S×S×(B×5+C)S \times S \times (B \times 5 + C)
where:
- S × S = Grid size (e.g., 7×7)
- B = Number of bounding boxes per grid cell (e.g., 2)
- 5 = (x, y, width, height, confidence score)
- C = Number of object classes (e.g., 80 for COCO dataset)
3. Bounding Box Prediction
Each bounding box consists of:
- (x, y) – Coordinates of the center of the object.
- (w, h) – Width and height relative to the image size.
- Confidence score – Probability that an object exists in that bounding box.
Bounding boxes with high confidence scores are retained.
4. Class Prediction
Each grid cell outputs a probability distribution over C classes. The class with the highest probability is assigned to each detected object.
5. Non-Maximum Suppression (NMS)
YOLO generates multiple overlapping bounding boxes. NMS removes redundant detections by:
- Keeping the bounding box with the highest confidence score.
- Removing boxes that have a high IoU (Intersection over Union) with the selected box.
6. Final Object Detection Output
After filtering, the final output contains:
- Detected objects with bounding box coordinates.
- Confidence scores indicating detection certainty.
- Class labels for detected objects.
Versions of YOLO
YOLOv1 (2016)
- Introduced the grid-based approach.
- Fast but struggled with small objects.
- High localization error due to limitations in bounding box regression.
YOLOv2 (YOLO9000)
- Improved accuracy using anchor boxes.
- Introduced batch normalization and multi-scale training.
- Supports detection of 9000+ classes.
YOLOv3
- Uses a deeper network (Darknet-53) for better feature extraction.
- Predicts bounding boxes at three different scales for better detection of small objects.
- Achieves a balance between speed and accuracy.
YOLOv4
- Introduced CSPDarknet53 for better performance.
- Added Mish activation function and self-adversarial training.
- Outperforms Faster R-CNN and SSD in accuracy.
YOLOv5 (Unofficial)
- Implemented in PyTorch (previous YOLOs were in Darknet).
- Optimized for ease of use and training.
- Supports efficient model deployment on edge devices.
YOLOv6, YOLOv7, and YOLOv8 (Latest)
- YOLOv7 focuses on speed-accuracy tradeoff.
- YOLOv8 introduces new architecture improvements and supports real-time tracking.
Implementing YOLO in Python
To use YOLO for object detection, we can leverage OpenCV and pre-trained YOLO models.
Step 1: Install Dependencies
pip install opencv-python numpy torch torchvision
Step 2: Load Pre-Trained YOLO Model
import cv2
import numpy as np
# Load YOLO model
net = cv2.dnn.readNet("yolov3.weights", "yolov3.cfg")
# Load class labels
with open("coco.names", "r") as f:
classes = [line.strip() for line in f.readlines()]
# Get layer names
layer_names = net.getLayerNames()
output_layers = [layer_names[i - 1] for i in net.getUnconnectedOutLayers()]
Step 3: Process an Image
# Read input image
image = cv2.imread("image.jpg")
height, width, _ = image.shape
# Convert image to blob (preprocessing)
blob = cv2.dnn.blobFromImage(image, 0.00392, (416, 416), swapRB=True, crop=False)
net.setInput(blob)
# Forward pass
outputs = net.forward(output_layers)
Step 4: Draw Bounding Boxes
# Process outputs
boxes = []
confidences = []
class_ids = []
for output in outputs:
for detection in output:
scores = detection[5:]
class_id = np.argmax(scores)
confidence = scores[class_id]
if confidence > 0.5:
center_x, center_y, w, h = detection[:4] * np.array([width, height, width, height])
x = int(center_x - w / 2)
y = int(center_y - h / 2)
boxes.append([x, y, int(w), int(h)])
confidences.append(float(confidence))
class_ids.append(class_id)
# Apply Non-Maximum Suppression
indices = cv2.dnn.NMSBoxes(boxes, confidences, 0.5, 0.4)
# Draw boxes
for i in indices.flatten():
x, y, w, h = boxes[i]
label = f"{classes[class_ids[i]]}: {confidences[i]:.2f}"
cv2.rectangle(image, (x, y), (x + w, y + h), (0, 255, 0), 2)
cv2.putText(image, label, (x, y - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
cv2.imshow("YOLO Object Detection", image)
cv2.waitKey(0)
cv2.destroyAllWindows()
Applications of YOLO
- Autonomous Vehicles – Detect pedestrians, vehicles, and traffic signals.
- Surveillance & Security – Identify intruders in real-time.
- Retail & Inventory Management – Detect objects on store shelves.
- Medical Imaging – Detect abnormalities in X-rays, MRIs.
- Agriculture – Identify plant diseases and crop monitoring.