Skip to content

BLIP: Bootstrapping Language-Image Pre-training

BLIP (Bootstrapping Language-Image Pre-training) is a unified vision-language model for both understanding and generation tasks. YOLO-Toys uses BLIP for image captioning and visual question answering.

Tasks

Image Captioning

Generate natural language descriptions of images:

Input: [Image]
Output: "A golden retriever playing with a red ball in a sunny park"

Visual Question Answering (VQA)

Answer questions about image content:

Input: [Image] + "What color is the dog's collar?"
Output: "blue"

Architecture

BLIP uses a unified architecture with three components:

┌─────────────────────────────────────────────────────────────┐
│                      BLIP Architecture                       │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────┐              ┌─────────────┐              │
│  │   Vision    │              │    Text     │              │
│  │   Encoder   │              │   Encoder   │              │
│  │  (ViT)      │              │  (BERT)     │              │
│  └─────────────┘              └─────────────┘              │
│         │                            │                     │
│         └────────────┬───────────────┘                     │
│                      │                                     │
│                      ▼                                     │
│         ┌─────────────────────────┐                        │
│         │   Multimodal Encoder    │                        │
│         │   (Image + Text)        │                        │
│         └─────────────────────────┘                        │
│                      │                                     │
│         ┌────────────┴────────────┐                        │
│         │                         │                        │
│         ▼                         ▼                        │
│  ┌─────────────┐          ┌─────────────┐                 │
│  │  Understanding│        │  Generation │                 │
│  │  Head        │          │  Head       │                 │
│  └─────────────┘          └─────────────┘                 │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Key innovations

1. Bootstrap training

BLIP bootstraps training by using a captioning model to generate synthetic captions, then filters them with a learning-based filter:

python
# Bootstrap loop
captions = generate_captions(images)
filtered = filter_captions(captions, images)
train_on(images, filtered)

2. Unified architecture

Same model handles both understanding (VQA) and generation (captioning):

  • Understanding: Encode image + question, predict answer
  • Generation: Encode image, decode caption token by token

3. Noise injection

Add noise to text inputs during training for robustness:

python
noisy_text = add_noise(text, prob=0.1)

Model variants

ModelParamsCaption QualityVQA Accuracy
BLIP-base223MGood75.0%
BLIP-large582MBetter78.3%

Usage in YOLO-Toys

Image Captioning

python
# POST /api/v1/inference/blip
{
  "model": "Salesforce/blip-image-captioning-base",
  "image": "<base64_image>"
}

# Response
{
  "caption": "A dog playing in the park"
}

Visual QA

python
# POST /api/v1/inference/blip
{
  "model": "Salesforce/blip-vqa-base",
  "image": "<base64_image>",
  "question": "What is the dog doing?"
}

# Response
{
  "answer": "playing"
}

Performance

Image Captioning (COCO test)

ModelBLEU-4METEORCIDEr
Show-Tell30.325.694.6
OSCAR36.530.3116.4
BLIP39.932.1133.2

VQA (VQAv2 test-dev)

ModelAccuracy
LXMERT72.5%
ViLT73.0%
BLIP78.3%

When to use BLIP

Image Captioning

  • Generate alt text for accessibility
  • Create image descriptions for search
  • Summarize visual content

Visual QA

  • Answer questions about image content
  • Extract specific information from images
  • Interactive image exploration

References

  1. Li, J., et al. "BLIP: Bootstrapping Language-Image Pre-training." ICML 2023. arXiv:2201.12086

Released under the MIT License.