DETR: Detection Transformer

DETR (Detection Transformer) represents a paradigm shift in object detection: it eliminates the need for hand-designed components like anchors and non-maximum suppression (NMS) by framing detection as a direct set prediction problem.

Figure 1. DETR architecture

DETR uses a CNN backbone for feature extraction, followed by a transformer encoder-decoder. Object queries learn to attend to image regions and directly predict bounding boxes.

The set prediction insight

Traditional detectors predict a large number of boxes (thousands) and then filter them using NMS. This creates several problems:

Duplicate predictions: NMS is a heuristic that may remove true positives
Anchor design: Requires careful tuning of anchor sizes and aspect ratios
Post-processing: Detection is not truly end-to-end

DETR's insight: treat detection as a set prediction problem. The model directly outputs a fixed-size set of predictions, one per "object query", with no post-processing needed.

Architecture overview

Backbone (CNN)

DETR uses a standard CNN backbone (ResNet-50 or ResNet-101) to extract features:

Input: H × W × 3
         ↓
ResNet backbone
         ↓
Output: H/32 × W/32 × 2048

The feature map is then projected to a lower dimension (d = 256) and flattened into a sequence.

Positional encoding

Since transformers have no built-in notion of spatial position, DETR adds 2D sinusoidal positional encodings to the feature map:

python

# 2D positional encoding
PE(pos, 2i)   = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

# Applied separately for x and y dimensions
PE_2d(x, y) = concat(PE(x), PE(y))

Transformer encoder

The encoder processes the flattened feature map with self-attention:

┌─────────────────────────────────────────────────────────────┐
│                   Transformer Encoder                        │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Features + Positional Encoding                             │
│       │                                                     │
│       ▼                                                     │
│  Self-Attention ──── Each position attends to all others    │
│       │                                                     │
│       ▼                                                     │
│  FFN ────────────── Position-wise transformation            │
│       │                                                     │
│       × 6 layers                                            │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Transformer decoder

The decoder takes object queries — learned embeddings that will become the final predictions:

┌─────────────────────────────────────────────────────────────┐
│                   Transformer Decoder                        │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Object Queries (N learned vectors)                         │
│       │                                                     │
│       ▼                                                     │
│  Self-Attention ──── Queries attend to each other           │
│       │                                                     │
│       ▼                                                     │
│  Cross-Attention ─── Queries attend to encoder output       │
│       │                                                     │
│       ▼                                                     │
│  FFN                                                        │
│       │                                                     │
│       × 6 layers                                            │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Key insight: Object queries compete for different image regions through cross-attention. Each query learns to specialize in detecting objects at certain positions.

Prediction heads

Two parallel FFNs produce the final outputs:

Class head: N × (C + 1) class predictions (including "no object")
Box head: N × 4 box coordinates (center_x, center_y, width, height)

Bipartite matching loss

The core innovation is the loss function. DETR uses Hungarian matching to find the optimal assignment between predictions and ground truth:

Step 1: Hungarian matching

Find the permutation σ that minimizes the matching cost:

$$ \hat{\sigma} = \arg\min_{\sigma} \sum_i^N \mathcal{L}{match}(y_i, \hat{y}{\sigma(i)}) $$

Where the matching cost combines classification and box similarity:

$$ \mathcal{L}{match} = -\mathbb{1}{{c_i \neq \varnothing}} \hat{p}{\sigma(i)}(c_i) + \mathbb{1}{{c_i \neq \varnothing}} \mathcal{L}{box}(b_i, \hat{b}{\sigma(i)}) $$

Step 2: Loss computation

After matching, compute the final loss:

$$ \mathcal{L} = \sum_i^N \left[ \lambda_{cls} \mathcal{L}{cls} + \lambda \mathcal{L}{box} + \lambda \mathcal{L}_{giou} \right] $$

The box loss uses L1 and GIoU:

$$ \mathcal{L}{box} = \lambda | b_i - \hat{b}{\sigma(i)} |1 + \lambda (1 - GIoU(b_i, \hat{b}{\sigma(i)})) $$

No NMS, no anchors

The bipartite matching loss ensures each ground truth object is matched to exactly one prediction. This eliminates:

Duplicate predictions: Each query produces at most one box
Anchor tuning: Object queries are learned, not designed
NMS heuristics: Set prediction handles duplicates naturally

Object queries: What do they learn?

Research has shown that object queries learn to specialize in:

Spatial regions: Some queries focus on image center, others on edges
Object scales: Some queries detect large objects, others small
Object counts: Queries learn to "compete" for objects

Visualization of query attention patterns reveals they learn meaningful spatial specializations without explicit supervision.

Performance

Model	Backbone	mAP (COCO)	FPS
DETR	ResNet-50	42.0	12
DETR	ResNet-101	43.5	10
DETR-DC5	ResNet-101	47.0	8

DC5 = dilated C5 stage for higher resolution features

Strengths

Large objects: DETR excels at detecting large objects
No hand-designed components: Fully learned detection
Clean architecture: Easy to understand and extend

Weaknesses

Small objects: Struggles with small object detection
Training time: Requires 500 epochs for full training
Convergence: Slower convergence than anchor-based methods

Deformable DETR

An extension that addresses convergence issues by using deformable attention:

Each query only attends to a small set of key sampling points
Sampling points are learned offsets from reference points
10× faster convergence, better small object detection

python

# Standard attention: O(HW × HW)
# Deformable attention: O(HW × K), K << HW

# Each query attends to K learned sampling points
deform_attn(q, p) = Σ_m W_m Σ_k A_mqk · W'_m · x(p + Δp_mqk)

When to use DETR

Dense scenes

DETR excels when objects are well-separated and large. Use for scene understanding tasks.

Research

DETR's clean architecture makes it ideal for research on detection and set prediction.

Small objects

For small object detection, consider Deformable DETR or YOLOv8 instead.

References

Carion, N., et al. "End-to-End Object Detection with Transformers." ECCV 2020. arXiv:2005.12872
Zhu, X., et al. "Deformable DETR: Deformable Transformers for End-to-End Object Detection." ICLR 2021. arXiv:2010.04102

DETR: Detection Transformer ​

The set prediction insight ​

Architecture overview ​

Backbone (CNN) ​

Positional encoding ​

Transformer encoder ​

Transformer decoder ​

Prediction heads ​

Bipartite matching loss ​

Step 1: Hungarian matching ​

Step 2: Loss computation ​

No NMS, no anchors ​

Object queries: What do they learn? ​

Performance ​

Strengths ​

Weaknesses ​

Deformable DETR ​

When to use DETR ​

References ​

What to read next ​