Vision-Language Models

Vision-language models (VLMs) bridge the gap between visual perception and natural language understanding. YOLO-Toys serves several VLMs for open-vocabulary detection, image captioning, and visual question answering.

Why vision-language models?

Traditional vision models have a fundamental limitation: they only understand categories they were trained on. VLMs overcome this by learning from image-text pairs, enabling:

Open-vocabulary detection: Detect objects described in natural language
Image captioning: Generate natural language descriptions
Visual QA: Answer questions about image content

Models in YOLO-Toys

OWL-ViT

Open-vocabulary object detection using contrastive pre-training.

Task: Text-conditioned detection
Input: Image + text queries
Output: Bounding boxes for each query
Use case: Detect arbitrary objects without training

Grounding DINO

Phrase grounding with fused vision-language features.

Task: Phrase grounding (link text to image regions)
Input: Image + text description
Output: Bounding boxes for each phrase
Use case: Detailed scene understanding

BLIP

Bootstrapping Language-Image Pre-training for unified vision-language understanding and generation.

Task: Image captioning, visual QA
Input: Image (± text question)
Output: Natural language description/answer
Use case: Content understanding, accessibility

Comparison

Model	Task	Open-vocab	Output
OWL-ViT	Detection	✓	Boxes
Grounding DINO	Grounding	✓	Boxes
BLIP	Caption/VQA	N/A	Text

Vision-Language Models ​

Why vision-language models? ​

Models in YOLO-Toys ​

OWL-ViT ​

Grounding DINO ​

BLIP ​

Comparison ​

What to read next ​