Grounding DINO: Phrase Grounding Meets Detection
Grounding DINO combines the best of grounding models and detection models: it can detect objects described by natural language phrases with high accuracy, while maintaining the ability to work with open vocabulary.
The grounding problem
Phrase grounding: Given an image and a text description, find the image regions corresponding to each phrase.
Example:
Text: "A dog chasing a red ball in the park"
Phrases: ["dog", "red ball", "park"]
Output: [box_dog, box_ball, box_park]This is more precise than open-vocabulary detection because it leverages the full sentence context.
Architecture
Grounding DINO is a fusion of:
- DINO: A detection model with self-supervised pre-training
- Grounding: Text-conditioned localization
┌─────────────────────────────────────────────────────────────┐
│ Grounding DINO │
├─────────────────────────────────────────────────────────────┤
│ │
│ Image ────▶ Backbone ────▶ Image Features │
│ │ │
│ │ │
│ Text ─────▶ BERT ──────────▶ Text Features │
│ │ │
│ ▼ │
│ Feature Fusion (cross-attn) │
│ │ │
│ ▼ │
│ Detection Head │
│ │ │
│ ▼ │
│ Grounded Boxes │
│ │
└─────────────────────────────────────────────────────────────┘Key innovations
1. Feature fusion
Early fusion of image and text features:
# Cross-modal attention
fused = cross_attention(image_feat, text_feat)
# Use fused features for detection2. Contrastive phrase grounding
Train with contrastive loss to align phrases with regions:
$$ \mathcal{L}_{ground} = -\log \frac{\exp(\text{sim}(r_i, p_i))}{\sum_j \exp(\text{sim}(r_i, p_j))} $$
3. Zero-shot transfer
Pre-trained on large image-text datasets, transfers to new domains without fine-tuning.
Performance
| Model | COCO mAP | LVIS APr | RefCOCO Acc |
|---|---|---|---|
| OWL-ViT | 42.6 | 31.5 | - |
| GLIP | 49.8 | 27.0 | - |
| Grounding DINO | 52.5 | 33.8 | 85.6 |
When to use Grounding DINO
Recommended
- Detailed scene descriptions with multiple objects
- Phrase-level grounding (not just object names)
- High accuracy open-vocabulary detection
Note
Slower than YOLOv8 for detection-only tasks. Use OWL-ViT for simple open-vocab detection.
References
- Liu, S., et al. "Grounding DINO: Marrying DINO with Grounded Pre-Training." ECCV 2024. arXiv:2303.05499