YOLO-Toys Whitepaper

YOLO-Toys sits at the intersection of practical serving infrastructure and upstream model research. This page serves two purposes:

Canonical bibliography — citable references for the model families, frameworks, and patterns the runtime depends on
Comparative positioning — situating YOLO-Toys among adjacent open-source systems solving related serving problems

Core bibliography

Area	Primary work	Relevance to YOLO-Toys
Object detection (anchor-based)	Redmon et al. (CVPR 2016)	foundational YOLO lineage that dominates the throughput-optimized path
Object detection (anchor-free)	Jocher et al. (Ultralytics, 2023)	the execution model for YOLOv8 anchor-free detection
Detection transformers	Carion et al. (ECCV 2020)	transformer-based detection lineage — DETR handler
Open-vocabulary detection	Minderer et al. (ECCV 2022)	text-conditioned detection support via OWL-ViT handler
Grounded detection	Liu et al. (ECCV 2024)	phrase-grounding support via Grounding DINO handler
Vision-language pretraining	Li et al. (ICML 2023)	captioning and VQA support via BLIP handler
Serving framework	Ramalho (2019); Paszke et al. (NeurIPS 2019)	runtime substrate and model execution environment
Async web framework	Archer (2018)	ASGI foundation powering FastAPI
Configuration patterns	Colucci et al. (2022)	type-safe environment-variable ingestion
Design patterns	Gamma et al. (1994)	Strategy, Registry, Adapter patterns in handler topology

Detection algorithms

YOLOv1 — Original YOLO

[1]

Redmon, J., Divvala, S., Girshick, R., and Farhadi, A.You Only Look Once: Unified, Real-Time Object DetectionIEEE Conference on Computer Vision and Pattern Recognition (CVPR)(2016)↗ Link

The original YOLO paper reframed detection as a single regression problem: divide the image into an S×S grid, predict bounding boxes and class probabilities from each grid cell in one forward pass. This unified formulation traded recall on small objects for dramatically better throughput — a trade-off that remained central to the YOLO lineage for years.

bibtex

@inproceedings{redmon2016yolo,
  title     = {You Only Look Once: Unified, Real-Time Object Detection},
  author    = {Redmon, Joseph and Divvala, Santosh and Girshick, Ross and Farhadi, Ali},
  booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2016},
  pages     = {779--788}
}

YOLOv8 — Anchor-free YOLO

[2]

Jocher, G., Chaurasia, A., and Qiu, J.Ultralytics YOLOv8Software release(2023)↗ Link

YOLOv8 is the version served by YOLO-Toys. It introduces an anchor-free detection head with decoupled classification and regression branches, C2f backbone blocks (Cross Stage Partial with two bottlenecks), and native multi-task support for detection, segmentation, and pose in a single architecture family. The YOLOHandler in YOLO-Toys calls ultralytics.YOLO(model_id) directly and delegates all family-specific logic to the Ultralytics library.

bibtex

@software{jocher2023ultralytics,
  title  = {Ultralytics YOLOv8},
  author = {Jocher, Glenn and Chaurasia, Ayush and Qiu, Jing},
  year   = {2023},
  url    = {https://github.com/ultralytics/ultralytics}
}

DETR — End-to-End Object Detection with Transformers

[3]

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S.End-to-End Object Detection with TransformersEuropean Conference on Computer Vision (ECCV)(2020)↗ Link

DETR eliminates anchors and NMS by treating detection as a set prediction problem. A Transformer encoder-decoder with learned object queries produces a fixed-size set of predictions in one forward pass, matched to ground truth via the Hungarian algorithm during training. DETR's clean formulation influenced the direction of detection research significantly, though its slow convergence (500 epochs vs YOLO's ~100) remains a practical limitation.

bibtex

@inproceedings{carion2020detr,
  title     = {End-to-End Object Detection with Transformers},
  author    = {Carion, Nicolas and Massa, Francisco and Synnaeve, Gabriel and Usunier, Nicolas and Kirillov, Alexander and Zagoruyko, Sergey},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2020},
  pages     = {213--229}
}

Vision-language models

OWL-ViT — Open-Vocabulary Detection

[4]

Minderer, M., Gritsenko, A., Stone, A., Neumann, M., Weissenborn, D., Dosovitskiy, A., Mahendran, A., Arnab, A., Dehghani, M., Shen, Z., Wang, X., Zhai, X., Kolesnikov, A., and Houlsby, N.Simple Open-Vocabulary Object Detection with Vision TransformersEuropean Conference on Computer Vision (ECCV)(2022)↗ Link

OWL-ViT applies contrastive pre-training (CLIP-style) to the detection problem. A Vision Transformer backbone processes the image; text query embeddings are projected into the same space. At inference time, the model produces bounding boxes conditioned on arbitrary text queries — enabling zero-shot detection of novel object categories. YOLO-Toys exposes this through the OWLViTHandler, which accepts text_queries as an InferenceParams field.

bibtex

@inproceedings{minderer2022owlvit,
  title     = {Simple Open-Vocabulary Object Detection with Vision Transformers},
  author    = {Minderer, Matthias and others},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2022}
}

Grounding DINO — Phrase Grounding

[5]

Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Li, C., Yang, J., Su, H., Zhu, J., and Zhang, L.Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object DetectionEuropean Conference on Computer Vision (ECCV)(2024)↗ Link

Grounding DINO fuses DINO (self-supervised ViT) with grounded pre-training to produce an open-set detector that grounds natural language phrases to regions. Unlike OWL-ViT, it supports phrase-level grounding (detecting objects described by multi-word phrases like "a person wearing a red jacket") rather than class-level conditioning. The GroundingDINOHandler wraps transformers.AutoModelForZeroShotObjectDetection.

bibtex

@inproceedings{liu2023grounding,
  title     = {Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection},
  author    = {Liu, Shilong and others},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2024}
}

BLIP — Bootstrapped Language-Image Pre-Training

[6]

Li, J., Li, D., Xiong, C., and Hoi, S.BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and GenerationInternational Conference on Machine Learning (ICML)(2022)↗ Link

BLIP introduces a bootstrapping framework for vision-language pre-training that uses a captioner and filter to reduce noise in web-crawled image-text pairs. The resulting model supports both understanding (visual question answering) and generation (image captioning) in one architecture. YOLO-Toys implements two BLIP surfaces: BLIPCaptionHandler for generation and BLIPVQAHandler for understanding.

bibtex

@inproceedings{li2022blip,
  title     = {BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation},
  author    = {Li, Junnan and Li, Dongxu and Xiong, Caiming and Hoi, Steven},
  booktitle = {International Conference on Machine Learning (ICML)},
  year      = {2022}
}

Runtime infrastructure

FastAPI

[7]

Ramírez, S.FastAPI — Modern web framework for building APIs with PythonOpen source software(2019)↗ Link

bibtex

@software{ramirez2019fastapi,
  title  = {FastAPI},
  author = {Ramírez, Sebastián},
  year   = {2019},
  url    = {https://github.com/tiangolo/fastapi}
}

PyTorch

[8]

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., and others.PyTorch: An Imperative Style, High-Performance Deep Learning LibraryAdvances in Neural Information Processing Systems (NeurIPS)(2019)↗ Link

bibtex

@inproceedings{paszke2019pytorch,
  title     = {PyTorch: An Imperative Style, High-Performance Deep Learning Library},
  author    = {Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and others},
  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  year      = {2019}
}

Adjacent open-source systems

Project	Design posture	Why compare it	Repository
Triton Inference Server	Maximum throughput, multiple backends	Industry reference for scale-first serving	`triton-inference-server/server`
TorchServe	Worker-per-model, packaging-first	PyTorch's official serving approach	`pytorch/serve`
BentoML	Packaging and deployment ergonomics	MLOps-oriented serving framework	`bentoml/BentoML`
Ultralytics	YOLO-family execution	Upstream library YOLO-Toys depends on	`ultralytics/ultralytics`
vLLM	PagedAttention, continuous batching	Reference for memory-efficient LLM serving	`vllm-project/vllm`
Ray Serve	Actor-based distributed serving	Distributed vision serving reference	`ray-project/ray`
ONNX Runtime	Cross-platform model execution	Potential execution backend for YOLO-Toys	`microsoft/onnxruntime`
TensorRT-LLM	NVIDIA GPU-optimized serving	GPU inference optimization reference	`NVIDIA/TensorRT-LLM`

Comparative reading guide

Triton Inference Server is the industry standard for high-scale model serving: multiple backends (TensorRT, ONNX, PyTorch, TensorFlow), dynamic batching, multi-GPU scheduling, and mature operational tooling. YOLO-Toys is intentionally narrower — it serves only vision models, runs in a single Python process, and optimizes for developer readability over raw throughput. For workloads exceeding hundreds of requests per second, Triton is the right escalation path.

TorchServe uses a worker-per-model architecture with a model-archiver packaging format. Model isolation is stronger but inter-process communication adds overhead. YOLO-Toys favors a single-process shared-cache model where hot models benefit from warm GPU memory without worker handoff.

BentoML is a model packaging and deployment framework with strong MLOps ergonomics. It abstracts artifact management, deployment pipelines, and service definitions into a deployable unit. YOLO-Toys is more opinionated around heterogeneous vision inference — the handler and registry system provides a built-in architecture rather than a blank deployment surface.

vLLM is included despite being LLM-focused because its PagedAttention memory management and continuous batching solve problems structurally similar to YOLO-Toys' cache pressure concerns. The vision serving community lacks an equivalent memory-management innovation; the YOLO-Toys LRU+TTL+memory-pressure cache is a practical approximation.

RT-DETR (Baidu, 2023) and DINO (Zhang et al., 2022) are notable omissions from the current runtime. They represent the state-of-the-art in real-time transformer detection. Adding RT-DETR would require a new ModelCategory.HF_RTDETR entry, a handler extending DETRHandler, and registration in the model registry. This is the intended extension path for contributors.

Design pattern references

The YOLO-Toys architecture draws deliberately on established software engineering patterns:

Pattern	Where it appears in YOLO-Toys	Reference
Strategy / Template Method	`BaseHandler` + family-specific subclasses	Gamma et al., Design Patterns (1994)
Registry	`HandlerRegistry` — category-to-handler mapping	Fowler, Patterns of Enterprise Application Architecture (2002)
Deep Module	`LoadedModel` — hides processor complexity behind `infer()`	Ousterhout, A Philosophy of Software Design (2018)
Adapter	`SettingsModelManagerConfig` — Pydantic settings → protocol	Gamma et al., Design Patterns (1994)
Protocol / Interface Segregation	`ModelManagerConfig` — structural subtyping via PEP 544	Van Rossum, PEP 544 — Protocols for Python (2017)
Observer / MutationObserver	Dark mode detection in theme system	Browser API; implicit in VitePress theme

The intersection of these patterns is deliberate: YOLO-Toys uses Strategy to localize model-specific behavior, Registry to make dispatch deterministic and introspectable, and Deep Module to present a simple infer() surface regardless of underlying complexity.

Bibliography and Related Work ​

Core bibliography ​

Detection algorithms ​

YOLOv1 — Original YOLO ​

YOLOv8 — Anchor-free YOLO ​

DETR — End-to-End Object Detection with Transformers ​

Vision-language models ​

OWL-ViT — Open-Vocabulary Detection ​

Grounding DINO — Phrase Grounding ​

BLIP — Bootstrapped Language-Image Pre-Training ​

Runtime infrastructure ​

FastAPI ​

PyTorch ​

Adjacent open-source systems ​

Comparative reading guide ​

Design pattern references ​

What to read next ​

Bibliography and Related Work

Core bibliography

Detection algorithms

YOLOv1 — Original YOLO

YOLOv8 — Anchor-free YOLO

DETR — End-to-End Object Detection with Transformers

Vision-language models

OWL-ViT — Open-Vocabulary Detection

Grounding DINO — Phrase Grounding

BLIP — Bootstrapped Language-Image Pre-Training

Runtime infrastructure

FastAPI

PyTorch

Adjacent open-source systems

Comparative reading guide

Design pattern references

What to read next