Skip to content

Performance Benchmarks

These benchmarks establish the baseline performance characteristics of the YOLO-Toys runtime. They are internal operational baselines — not competitive comparisons — intended to help operators tune cache sizes, timeout values, and concurrency limits for their deployment hardware.

Key insight
The cold-start penalty is 10–40× the warm-start latency. This means the ModelCache is the single most performance-critical component in the runtime. Every cache miss costs seconds; every cache hit costs milliseconds.

Methodology

All measurements are taken under controlled conditions:

ParameterValue
HardwareIntel Core i7-12700H, 32 GB RAM, NVIDIA RTX 3060 Laptop (6 GB VRAM)
SoftwarePython 3.12, PyTorch 2.3.0, CUDA 12.1
Input640×480 BGR numpy array, random noise
Warmup3 inference runs before measurement to stabilize GPU clocks
MetricWall-clock time, single-threaded, time.perf_counter()
Repetitions10 runs per scenario, median reported

Cold-start vs warm-start latency

Figure 1. Latency comparison

Cold-start latency is the cost of loading a model from disk or HuggingFace Hub. Warm-start latency is the inference cost once the model is resident in the ModelCache. The ratio makes cache hit rate the dominant operational variable.

ModelCold (CPU)Cold (CUDA)Warm (CPU)Warm (CUDA)
yolov8n.pt0.45s0.12s18ms4ms
yolov8m.pt1.82s0.38s65ms12ms
facebook/detr-resnet-504.2s1.1s380ms90ms
google/owlvit-base-patch323.8s0.95s420ms110ms
Salesforce/blip-image-captioning-base2.1s0.55s280ms70ms

Why cold-start matters

YOLO-Toys uses lazy loading: models are loaded on first request, not at server startup. This keeps startup time fast but means the first caller for any model bears the full cold-start cost. In production, operators should pre-warm critical models using the /infer endpoint at startup before routing real traffic.

Throughput under concurrency

Simulated with locust (20 concurrent users, spawn rate 5/s, 2-minute run):

ScenarioRequests/secAvg latencyp95 latency
Single model (yolov8n.pt), cached142120ms180ms
Two models rotating, both cached118145ms220ms
Cache miss every 3rd request38420ms1.2s
Full GPU memory (OOM pressure)121.8s5.2s
Operational threshold
When cache hit rate drops below ~80%, the system enters a latency cliff where throughput collapses by 3–4×. Monitor /metrics for cache_size and alert when it approaches cache_maxsize.

Memory footprint per model family

ModelModel size (disk)Peak VRAM (inference)Cache overhead
yolov8n.pt6.2 MB180 MB~2 MB
yolov8m.pt49.7 MB420 MB~2 MB
facebook/detr-resnet-50159 MB680 MB~2 MB
google/owlvit-base-patch32587 MB1.1 GB~2 MB
Salesforce/blip-image-captioning-base990 MB1.6 GB~2 MB

Safe cache configurations for a 6 GB GPU

ConfigurationVRAM estimateStatus
3× BLIP~4.8 GBOOM risk
2× BLIP + 1× DETR~3.9 GBOOM risk
3× DETR~2.0 GBSafe
1× BLIP + 1× DETR + 1× YOLO~2.3 GBSafe
3× YOLO nano~0.6 GBVery safe

This is why memory_threshold=0.85 is critical: it triggers LRU eviction before the GPU is exhausted.

Memory-pressure eviction in practice

The eviction sequence ensures the GPU never exceeds safe memory bounds. The LRU policy guarantees that the least recently used model is evicted first, which in practice means rarely-used large models (BLIP, OWL-ViT) are evicted before frequently-used fast models (YOLOv8n).

Latency breakdown by request phase

For a warm YOLOv8n.pt request on GPU:

PhaseTypical durationNotes
HTTP parsing + Pydantic validation0.5–2msPydantic v2 is fast
Cache lookup< 0.1msDict lookup, O(1)
Image decode (JPEG → numpy)0.5–3msDepends on image size
YOLO inference2–8msResolution-dependent
Result normalization0.2–0.5msJSON serialization
Total3–14msp50 ≈ 5ms on RTX 3060

For a cold-start HuggingFace model (DETR on CPU):

PhaseTypical durationNotes
Registry lookup + handler init< 1ms
Model download (first run only)5–30sNetwork-dependent
Model load from disk2–4sTransformers deserialization
Device placement (CPU)0.2storch.to()
Warmup inference0.3sFirst call is slower
Subsequent requests380msWarm CPU inference

Running benchmarks locally

YOLO-Toys includes a pytest-benchmark suite:

bash
# From repository root
cd /path/to/yolo-toys
pytest tests/benchmarks/ --benchmark-only --benchmark-sort=mean

The suite covers:

  • Model load time per family (cold-start cost)
  • Inference latency per family (warm-start cost)
  • Cache hit vs miss latency comparison
  • Concurrent request throughput simulation

Released under the MIT License.