BLIP: Bootstrapping Language-Image Pre-training

BLIP (Bootstrapping Language-Image Pre-training) is a unified vision-language model for both understanding and generation tasks. YOLO-Toys uses BLIP for image captioning and visual question answering.

Tasks

Image Captioning

Generate natural language descriptions of images:

Input: [Image]
Output: "A golden retriever playing with a red ball in a sunny park"

Visual Question Answering (VQA)

Answer questions about image content:

Input: [Image] + "What color is the dog's collar?"
Output: "blue"

Architecture

BLIP uses a unified architecture with three components:

┌─────────────────────────────────────────────────────────────┐
│                      BLIP Architecture                       │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────┐              ┌─────────────┐              │
│  │   Vision    │              │    Text     │              │
│  │   Encoder   │              │   Encoder   │              │
│  │  (ViT)      │              │  (BERT)     │              │
│  └─────────────┘              └─────────────┘              │
│         │                            │                     │
│         └────────────┬───────────────┘                     │
│                      │                                     │
│                      ▼                                     │
│         ┌─────────────────────────┐                        │
│         │   Multimodal Encoder    │                        │
│         │   (Image + Text)        │                        │
│         └─────────────────────────┘                        │
│                      │                                     │
│         ┌────────────┴────────────┐                        │
│         │                         │                        │
│         ▼                         ▼                        │
│  ┌─────────────┐          ┌─────────────┐                 │
│  │  Understanding│        │  Generation │                 │
│  │  Head        │          │  Head       │                 │
│  └─────────────┘          └─────────────┘                 │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Key innovations

1. Bootstrap training

BLIP bootstraps training by using a captioning model to generate synthetic captions, then filters them with a learning-based filter:

python

# Bootstrap loop
captions = generate_captions(images)
filtered = filter_captions(captions, images)
train_on(images, filtered)

2. Unified architecture

Same model handles both understanding (VQA) and generation (captioning):

Understanding: Encode image + question, predict answer
Generation: Encode image, decode caption token by token

3. Noise injection

Add noise to text inputs during training for robustness:

python

noisy_text = add_noise(text, prob=0.1)

Model variants

Model	Params	Caption Quality	VQA Accuracy
BLIP-base	223M	Good	75.0%
BLIP-large	582M	Better	78.3%

Usage in YOLO-Toys

Image Captioning

python

# POST /api/v1/inference/blip
{
  "model": "Salesforce/blip-image-captioning-base",
  "image": "<base64_image>"
}

# Response
{
  "caption": "A dog playing in the park"
}

Visual QA

python

# POST /api/v1/inference/blip
{
  "model": "Salesforce/blip-vqa-base",
  "image": "<base64_image>",
  "question": "What is the dog doing?"
}

# Response
{
  "answer": "playing"
}

Performance

Image Captioning (COCO test)

Model	BLEU-4	METEOR	CIDEr
Show-Tell	30.3	25.6	94.6
OSCAR	36.5	30.3	116.4
BLIP	39.9	32.1	133.2

VQA (VQAv2 test-dev)

Model	Accuracy
LXMERT	72.5%
ViLT	73.0%
BLIP	78.3%

When to use BLIP

Image Captioning

Generate alt text for accessibility
Create image descriptions for search
Summarize visual content

Visual QA

Answer questions about image content
Extract specific information from images
Interactive image exploration

References

Li, J., et al. "BLIP: Bootstrapping Language-Image Pre-training." ICML 2023. arXiv:2201.12086

BLIP: Bootstrapping Language-Image Pre-training ​

Tasks ​

Image Captioning ​

Visual Question Answering (VQA) ​

Architecture ​

Key innovations ​

1. Bootstrap training ​

2. Unified architecture ​

3. Noise injection ​

Model variants ​

Usage in YOLO-Toys ​

Image Captioning ​

Visual QA ​

Performance ​

Image Captioning (COCO test) ​

VQA (VQAv2 test-dev) ​

When to use BLIP ​

References ​

What to read next ​

BLIP: Bootstrapping Language-Image Pre-training

Tasks

Image Captioning

Visual Question Answering (VQA)

Architecture

Key innovations

1. Bootstrap training

2. Unified architecture

3. Noise injection

Model variants

Usage in YOLO-Toys

Image Captioning

Visual QA

Performance

Image Captioning (COCO test)

VQA (VQAv2 test-dev)

When to use BLIP

References

What to read next