Skip to content

Bibliography and Related Work

YOLO-Toys sits at the intersection of practical serving infrastructure and upstream model research. This page serves two purposes:

  1. Canonical bibliography — citable references for the model families, frameworks, and patterns the runtime depends on
  2. Comparative positioning — situating YOLO-Toys among adjacent open-source systems solving related serving problems

Core bibliography

AreaPrimary workRelevance to YOLO-Toys
Object detection (anchor-based)Redmon et al. (CVPR 2016)foundational YOLO lineage that dominates the throughput-optimized path
Object detection (anchor-free)Jocher et al. (Ultralytics, 2023)the execution model for YOLOv8 anchor-free detection
Detection transformersCarion et al. (ECCV 2020)transformer-based detection lineage — DETR handler
Open-vocabulary detectionMinderer et al. (ECCV 2022)text-conditioned detection support via OWL-ViT handler
Grounded detectionLiu et al. (ECCV 2024)phrase-grounding support via Grounding DINO handler
Vision-language pretrainingLi et al. (ICML 2023)captioning and VQA support via BLIP handler
Serving frameworkRamalho (2019); Paszke et al. (NeurIPS 2019)runtime substrate and model execution environment
Async web frameworkArcher (2018)ASGI foundation powering FastAPI
Configuration patternsColucci et al. (2022)type-safe environment-variable ingestion
Design patternsGamma et al. (1994)Strategy, Registry, Adapter patterns in handler topology

Detection algorithms

YOLOv1 — Original YOLO

[1]
Redmon, J., Divvala, S., Girshick, R., and Farhadi, A.You Only Look Once: Unified, Real-Time Object DetectionIEEE Conference on Computer Vision and Pattern Recognition (CVPR)(2016) Link

The original YOLO paper reframed detection as a single regression problem: divide the image into an S×S grid, predict bounding boxes and class probabilities from each grid cell in one forward pass. This unified formulation traded recall on small objects for dramatically better throughput — a trade-off that remained central to the YOLO lineage for years.

bibtex
@inproceedings{redmon2016yolo,
  title     = {You Only Look Once: Unified, Real-Time Object Detection},
  author    = {Redmon, Joseph and Divvala, Santosh and Girshick, Ross and Farhadi, Ali},
  booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2016},
  pages     = {779--788}
}

YOLOv8 — Anchor-free YOLO

[2]
Jocher, G., Chaurasia, A., and Qiu, J.Ultralytics YOLOv8Software release(2023) Link

YOLOv8 is the version served by YOLO-Toys. It introduces an anchor-free detection head with decoupled classification and regression branches, C2f backbone blocks (Cross Stage Partial with two bottlenecks), and native multi-task support for detection, segmentation, and pose in a single architecture family. The YOLOHandler in YOLO-Toys calls ultralytics.YOLO(model_id) directly and delegates all family-specific logic to the Ultralytics library.

bibtex
@software{jocher2023ultralytics,
  title  = {Ultralytics YOLOv8},
  author = {Jocher, Glenn and Chaurasia, Ayush and Qiu, Jing},
  year   = {2023},
  url    = {https://github.com/ultralytics/ultralytics}
}

DETR — End-to-End Object Detection with Transformers

[3]
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S.End-to-End Object Detection with TransformersEuropean Conference on Computer Vision (ECCV)(2020) Link

DETR eliminates anchors and NMS by treating detection as a set prediction problem. A Transformer encoder-decoder with learned object queries produces a fixed-size set of predictions in one forward pass, matched to ground truth via the Hungarian algorithm during training. DETR's clean formulation influenced the direction of detection research significantly, though its slow convergence (500 epochs vs YOLO's ~100) remains a practical limitation.

bibtex
@inproceedings{carion2020detr,
  title     = {End-to-End Object Detection with Transformers},
  author    = {Carion, Nicolas and Massa, Francisco and Synnaeve, Gabriel and Usunier, Nicolas and Kirillov, Alexander and Zagoruyko, Sergey},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2020},
  pages     = {213--229}
}

Vision-language models

OWL-ViT — Open-Vocabulary Detection

[4]
Minderer, M., Gritsenko, A., Stone, A., Neumann, M., Weissenborn, D., Dosovitskiy, A., Mahendran, A., Arnab, A., Dehghani, M., Shen, Z., Wang, X., Zhai, X., Kolesnikov, A., and Houlsby, N.Simple Open-Vocabulary Object Detection with Vision TransformersEuropean Conference on Computer Vision (ECCV)(2022) Link

OWL-ViT applies contrastive pre-training (CLIP-style) to the detection problem. A Vision Transformer backbone processes the image; text query embeddings are projected into the same space. At inference time, the model produces bounding boxes conditioned on arbitrary text queries — enabling zero-shot detection of novel object categories. YOLO-Toys exposes this through the OWLViTHandler, which accepts text_queries as an InferenceParams field.

bibtex
@inproceedings{minderer2022owlvit,
  title     = {Simple Open-Vocabulary Object Detection with Vision Transformers},
  author    = {Minderer, Matthias and others},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2022}
}

Grounding DINO — Phrase Grounding

[5]
Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Li, C., Yang, J., Su, H., Zhu, J., and Zhang, L.Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object DetectionEuropean Conference on Computer Vision (ECCV)(2024) Link

Grounding DINO fuses DINO (self-supervised ViT) with grounded pre-training to produce an open-set detector that grounds natural language phrases to regions. Unlike OWL-ViT, it supports phrase-level grounding (detecting objects described by multi-word phrases like "a person wearing a red jacket") rather than class-level conditioning. The GroundingDINOHandler wraps transformers.AutoModelForZeroShotObjectDetection.

bibtex
@inproceedings{liu2023grounding,
  title     = {Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection},
  author    = {Liu, Shilong and others},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2024}
}

BLIP — Bootstrapped Language-Image Pre-Training

[6]
Li, J., Li, D., Xiong, C., and Hoi, S.BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and GenerationInternational Conference on Machine Learning (ICML)(2022) Link

BLIP introduces a bootstrapping framework for vision-language pre-training that uses a captioner and filter to reduce noise in web-crawled image-text pairs. The resulting model supports both understanding (visual question answering) and generation (image captioning) in one architecture. YOLO-Toys implements two BLIP surfaces: BLIPCaptionHandler for generation and BLIPVQAHandler for understanding.

bibtex
@inproceedings{li2022blip,
  title     = {BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation},
  author    = {Li, Junnan and Li, Dongxu and Xiong, Caiming and Hoi, Steven},
  booktitle = {International Conference on Machine Learning (ICML)},
  year      = {2022}
}

Runtime infrastructure

FastAPI

[7]
Ramírez, S.FastAPI — Modern web framework for building APIs with PythonOpen source software(2019) Link
bibtex
@software{ramirez2019fastapi,
  title  = {FastAPI},
  author = {Ramírez, Sebastián},
  year   = {2019},
  url    = {https://github.com/tiangolo/fastapi}
}

PyTorch

[8]
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., and others.PyTorch: An Imperative Style, High-Performance Deep Learning LibraryAdvances in Neural Information Processing Systems (NeurIPS)(2019) Link
bibtex
@inproceedings{paszke2019pytorch,
  title     = {PyTorch: An Imperative Style, High-Performance Deep Learning Library},
  author    = {Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and others},
  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  year      = {2019}
}

Adjacent open-source systems

ProjectDesign postureWhy compare itRepository
Triton Inference ServerMaximum throughput, multiple backendsIndustry reference for scale-first servingtriton-inference-server/server
TorchServeWorker-per-model, packaging-firstPyTorch's official serving approachpytorch/serve
BentoMLPackaging and deployment ergonomicsMLOps-oriented serving frameworkbentoml/BentoML
UltralyticsYOLO-family executionUpstream library YOLO-Toys depends onultralytics/ultralytics
vLLMPagedAttention, continuous batchingReference for memory-efficient LLM servingvllm-project/vllm
Ray ServeActor-based distributed servingDistributed vision serving referenceray-project/ray
ONNX RuntimeCross-platform model executionPotential execution backend for YOLO-Toysmicrosoft/onnxruntime
TensorRT-LLMNVIDIA GPU-optimized servingGPU inference optimization referenceNVIDIA/TensorRT-LLM

Comparative reading guide

Triton Inference Server is the industry standard for high-scale model serving: multiple backends (TensorRT, ONNX, PyTorch, TensorFlow), dynamic batching, multi-GPU scheduling, and mature operational tooling. YOLO-Toys is intentionally narrower — it serves only vision models, runs in a single Python process, and optimizes for developer readability over raw throughput. For workloads exceeding hundreds of requests per second, Triton is the right escalation path.

TorchServe uses a worker-per-model architecture with a model-archiver packaging format. Model isolation is stronger but inter-process communication adds overhead. YOLO-Toys favors a single-process shared-cache model where hot models benefit from warm GPU memory without worker handoff.

BentoML is a model packaging and deployment framework with strong MLOps ergonomics. It abstracts artifact management, deployment pipelines, and service definitions into a deployable unit. YOLO-Toys is more opinionated around heterogeneous vision inference — the handler and registry system provides a built-in architecture rather than a blank deployment surface.

vLLM is included despite being LLM-focused because its PagedAttention memory management and continuous batching solve problems structurally similar to YOLO-Toys' cache pressure concerns. The vision serving community lacks an equivalent memory-management innovation; the YOLO-Toys LRU+TTL+memory-pressure cache is a practical approximation.

RT-DETR (Baidu, 2023) and DINO (Zhang et al., 2022) are notable omissions from the current runtime. They represent the state-of-the-art in real-time transformer detection. Adding RT-DETR would require a new ModelCategory.HF_RTDETR entry, a handler extending DETRHandler, and registration in the model registry. This is the intended extension path for contributors.


Design pattern references

The YOLO-Toys architecture draws deliberately on established software engineering patterns:

PatternWhere it appears in YOLO-ToysReference
Strategy / Template MethodBaseHandler + family-specific subclassesGamma et al., Design Patterns (1994)
RegistryHandlerRegistry — category-to-handler mappingFowler, Patterns of Enterprise Application Architecture (2002)
Deep ModuleLoadedModel — hides processor complexity behind infer()Ousterhout, A Philosophy of Software Design (2018)
AdapterSettingsModelManagerConfig — Pydantic settings → protocolGamma et al., Design Patterns (1994)
Protocol / Interface SegregationModelManagerConfig — structural subtyping via PEP 544Van Rossum, PEP 544 — Protocols for Python (2017)
Observer / MutationObserverDark mode detection in theme systemBrowser API; implicit in VitePress theme

The intersection of these patterns is deliberate: YOLO-Toys uses Strategy to localize model-specific behavior, Registry to make dispatch deterministic and introspectable, and Deep Module to present a simple infer() surface regardless of underlying complexity.


Released under the MIT License.