Detection Algorithms Overview

Object detection is the foundational task for YOLO-Toys. This section covers the three major paradigms: anchor-based, anchor-free, and transformer-based detection.

The detection problem

Given an image, object detection requires:

Localization: Where are the objects? (bounding boxes)
Classification: What are the objects? (class labels)

The challenge is that the number of objects varies per image, and objects can appear at any location and scale.

Three paradigms

Anchor-based detection

The dominant paradigm from 2014–2022:

Place anchor boxes (priors) at each spatial location
Predict offsets from anchors
Apply NMS to remove duplicates

Representatives: YOLOv2–v5, Faster R-CNN, SSD, RetinaNet

Pros:

Well-understood optimization landscape
Good performance with proper anchor design

Cons:

Anchor design is hyperparameter-sensitive
Many duplicate predictions
Struggles with unusual aspect ratios

Anchor-free detection

Treat detection as keypoint estimation:

Predict object centers directly
Regress size from center features
No anchor boxes needed

Representatives: YOLOv8, CenterNet, FCOS

Pros:

No anchor hyperparameters
Fewer duplicate predictions
Better generalization to unusual objects

Cons:

Newer, less mature
Training can be less stable

Transformer-based detection

Treat detection as set prediction:

Learn object queries
Queries attend to image features
Direct set output, no NMS

Representatives: DETR, Deformable DETR

Pros:

End-to-end, no hand-designed components
Clean architecture
Good for research

Cons:

Slow convergence
Struggles with small objects
Higher computational cost

Paradigm comparison

Feature	Anchor-based	Anchor-free	Transformer
NMS needed	Yes	Yes/No	No
Anchors	Yes	No	No
Training stability	High	Medium	Low
Inference speed	Fast	Fast	Slower
Small objects	Good	Medium	Struggles
Large objects	Good	Good	Excellent

Model selection guide

[Task requirements?]
│
├─ Real-time edge? ──────▶ YOLOv8n (anchor-free, fastest)
│
├─ Production API? ──────▶ YOLOv8s/m (anchor-free, balanced)
│
├─ Maximum accuracy? ────▶ YOLOv8l/x or DETR
│
├─ Dense scenes? ────────▶ DETR (handles overlap well)
│
├─ Novel classes? ───────▶ OWL-ViT (open-vocabulary)
│
└─ Text-conditioned? ────▶ OWL-ViT or Grounding DINO

Detection Algorithms Overview ​

The detection problem ​

Three paradigms ​

Anchor-based detection ​

Anchor-free detection ​

Transformer-based detection ​

Paradigm comparison ​

Model selection guide ​

What to read next ​

Detection Algorithms Overview

The detection problem

Three paradigms

Anchor-based detection

Anchor-free detection

Transformer-based detection

Paradigm comparison

Model selection guide

What to read next