Skip to content

Model Matrix

Complete specifications for all supported model families. Use this reference to select the right model for your task, understand the performance trade-offs, and configure inference parameters correctly.

Selection principle
YOLOv8 for throughput-critical applications; DETR for dense or complex scenes; OWL-ViT / Grounding DINO for open-vocabulary tasks; BLIP for language-grounded understanding.

Quick comparison

FamilyParadigmGPU Warm LatencyCOCO mAPOpen-vocabNotes
YOLOv8nAnchor-free4ms37.3NoBest throughput
YOLOv8sAnchor-free6ms44.9NoBalanced
YOLOv8mAnchor-free12ms50.2NoHigh accuracy
YOLOv8lAnchor-free18ms52.9NoMaximum accuracy
DETR (ResNet-50)Transformer90ms~42NoDense scenes
OWL-ViT (base-patch32)VLM110msYesNovel classes
Grounding DINOVLM130msYesPhrase grounding
BLIP-CaptionVLM70msYesImage captioning
BLIP-VQAVLM70msYesVisual QA

YOLO family

YOLO-Toys serves the YOLOv8 family via the YOLOHandler. Model files must be in Ultralytics .pt format. The handler automatically infers the task type (detection, segmentation, or pose) from the model weights.

Detection models

bash
yolov8n.pt   # Nano   — 6.2M params, fastest
yolov8s.pt   # Small  — 11.2M params, balanced
yolov8m.pt   # Medium — 25.9M params, higher accuracy
yolov8l.pt   # Large  — 43.7M params, maximum accuracy
yolov8x.pt   # XLarge — 68.2M params, research-grade
ModelParametersDisk sizeCOCO mAP (val2017)GPU warm (p50)
yolov8n3.2M6.2 MB37.34ms
yolov8s11.2M21.5 MB44.96ms
yolov8m25.9M49.7 MB50.212ms
yolov8l43.7M83.7 MB52.918ms

Trained on COCO 2017, 80 classes: person, bicycle, car, motorcycle, airplane, bus, train, truck, boat, traffic light, fire hydrant, stop sign, parking meter, bench, bird, cat, dog, horse, sheep, cow, elephant, bear, zebra, giraffe, backpack, umbrella, handbag, tie, suitcase, frisbee, skis, snowboard, sports ball, kite, baseball bat, baseball glove, skateboard, surfboard, tennis racket, bottle, wine glass, cup, fork, knife, spoon, bowl, banana, apple, sandwich, orange, broccoli, carrot, hot dog, pizza, donut, cake, chair, couch, potted plant, bed, dining table, toilet, tv, laptop, mouse, remote, keyboard, cell phone, microwave, oven, toaster, sink, refrigerator, book, clock, vase, scissors, teddy bear, hair drier, toothbrush.

Segmentation models

bash
yolov8n-seg.pt   # Nano segmentation
yolov8s-seg.pt   # Small segmentation
yolov8m-seg.pt   # Medium segmentation

Returns: bounding boxes + pixel-level segmentation masks. The task field in the response is "segment".

Pose models

bash
yolov8n-pose.pt   # Nano pose
yolov8s-pose.pt   # Small pose

Returns: bounding boxes + 17 COCO keypoints (nose, left/right eye, left/right ear, left/right shoulder, left/right elbow, left/right wrist, left/right hip, left/right knee, left/right ankle). The task field is "pose".

Default inference parameters

ParameterDefaultRangeNotes
conf0.25[0.0, 1.0]Minimum confidence threshold
iou0.45[0.0, 1.0]NMS IoU threshold
max_det300[1, 1000]Maximum detections per image
imgszmodel defaultintInput image size override
halffalseboolFP16 inference (CUDA only)

HuggingFace models

These models are loaded from the HuggingFace Hub on first use. Loading requires network access and takes 2–10 seconds on first request. Subsequent requests use the warm cached model.

DETR — Detection Transformer

bash
facebook/detr-resnet-50         # Standard DETR, ResNet-50 backbone
facebook/detr-resnet-101        # DETR with ResNet-101 backbone (higher accuracy)
facebook/detr-resnet-50-panoptic  # Panoptic segmentation variant

DETR uses transformer encoder-decoder with learned object queries. No anchors, no NMS. Particularly strong on dense scenes and unusual aspect ratios. Slow to converge during training but clean inference semantics.

PropertyValue
HandlerDETRHandler
CategoryModelCategory.HF_DETR
GPU warm latency~90ms
CPU warm latency~380ms
Input preprocessingPIL image, ImageProcessor normalization
json
{ "conf": 0.5, "max_det": 100 }

OWL-ViT — Open-Vocabulary Detection

bash
google/owlvit-base-patch32      # Base model, patch size 32
google/owlvit-large-patch14     # Large model, patch size 14 (higher accuracy)

Text-conditioned detection using contrastive pre-training. Provide text_queries in your request to detect custom object categories without retraining.

PropertyValue
HandlerOWLViTHandler
CategoryModelCategory.HF_OWLVIT
GPU warm latency~110ms
Required parametertext_queries: ["a cat", "a dog"]

Grounding DINO — Phrase Grounding

bash
IDEA-Research/grounding-dino-tiny    # Tiny variant
IDEA-Research/grounding-dino-base    # Base variant

Open-set detection with natural language phrase grounding. More expressive than OWL-ViT for complex descriptions.

PropertyValue
HandlerGroundingDINOHandler
CategoryModelCategory.HF_GROUNDING
GPU warm latency~130ms
Required parametertext_queries: ["person wearing a red jacket"]

BLIP — Image Captioning and VQA

bash
Salesforce/blip-image-captioning-base    # Image captioning
Salesforce/blip-image-captioning-large   # Larger captioning model
Salesforce/blip-vqa-base                 # Visual question answering

Unified vision-language model supporting both generation (captioning) and understanding (VQA). Route determines behavior: /caption uses BLIPCaptionHandler, /vqa uses BLIPVQAHandler.

PropertyValue
Caption handlerBLIPCaptionHandler
VQA handlerBLIPVQAHandler
GPU warm latency~70ms
VQA parameterquestion: "What is in the image?"

Model ID inference rules

YOLO-Toys infers the correct handler from the model ID through a cascading resolution strategy:

  1. Exact registry match: if the model ID appears in MODEL_REGISTRY, use the registered category
  2. File extension heuristic: .pt files → ModelCategory.YOLO_* (with seg/pose sub-variants from filename)
  3. Keyword matching: detrHF_DETR, owlvitHF_OWLVIT, blip-image-captioningHF_BLIP_CAPTION, blip-vqaHF_BLIP_VQA, grounding or dinoHF_GROUNDING
  4. HuggingFace path fallback: any ID containing / not matched above → HF_DETR

This means common models work without explicit registration. Novel architectures require extending ModelCategory and _CATEGORY_HANDLER_MAP.


Released under the MIT License.