Model Loading Flow
The model loading subsystem is the heart of YOLO-Toys' operational behavior. It combines lazy loading, TTL-based freshness, and LRU eviction under memory pressure to provide predictable resource usage while maximizing warm-model reuse.
The Core Problem
Serving multiple model families creates a fundamental tension:
- Loading models is expensive — YOLOv8x takes 2-4 seconds to load on a modern GPU
- Memory is finite — A single YOLOv8x can consume 200MB+ of GPU memory
- Workloads are bursty — Users don't request models uniformly
The solution must balance:
- Latency: Minimize cold-start penalties
- Memory: Stay within hardware limits
- Freshness: Respect model update cycles
Architecture Overview
┌──────────────────────────────────────────────────────────────┐
│ ModelManager │
│ ┌─────────────────┐ ┌─────────────────┐ ┌───────────────┐ │
│ │ Security Check │→│ Cache Lookup │→│ Handler Select│ │
│ └─────────────────┘ └─────────────────┘ └───────────────┘ │
│ ↓ │
│ ┌─────────────────┐ │
│ │ ModelCache │ │
│ │ (TTL + LRU) │ │
│ └─────────────────┘ │
└──────────────────────────────────────────────────────────────┘Cache Strategy: TTL + LRU Hybrid
Why Both?
| Strategy | What it Guarantees | What it Misses |
|---|---|---|
| TTL only | Freshness (models expire) | No memory pressure handling |
| LRU only | Memory bounds | Stale models persist forever |
| TTL + LRU | Both freshness and bounds | Complexity |
The hybrid approach gives us the best of both worlds.
TTL (Time-To-Live)
# Default: 3600 seconds (1 hour)
MODEL_CACHE_TTL = int(os.getenv("MODEL_CACHE_TTL", "3600"))When a model's TTL expires:
- The model is marked as stale
- Next access triggers a reload
- The old model is evicted from cache
LRU (Least Recently Used)
# Default: 10 models max
MODEL_CACHE_MAXSIZE = int(os.getenv("MODEL_CACHE_MAXSIZE", "10"))When cache is full and a new model is requested:
- Find the least recently accessed model
- Evict it (freeing GPU memory)
- Load the new model
Memory-Aware Eviction
# Default: 85% of system memory
MODEL_MEMORY_THRESHOLD = float(os.getenv("MODEL_MEMORY_THRESHOLD", "0.85"))Even if the cache isn't full, if system memory usage exceeds the threshold:
- Trigger proactive LRU eviction
- Clear CUDA cache to reclaim GPU memory
def _evict_lru_unsafe(self) -> None:
"""Evict least recently used model and clear CUDA cache."""
lru_key = min(self._access_times, key=self._access_times.get)
del self[lru_key]
torch.cuda.empty_cache() # Critical for GPU memorySecurity Boundary
Before any model load, the request passes through security validation:
# Path traversal protection
forbidden_patterns = ["../", "..\\", "/", "\\", "\x00"]
for pattern in forbidden_patterns:
if pattern in model_id or pattern in decoded_id:
raise ValueError("Invalid model ID: forbidden pattern")This prevents:
- Directory traversal attacks (
../../../etc/passwd) - URL-encoded bypasses (
%2e%2e%2f) - Null byte injection (
model.pt%00.exe)
Handler Selection
The registry maps model categories to handlers:
class ModelCategory(Enum):
YOLO_DETECT = auto()
YOLO_SEGMENT = auto()
YOLO_POSE = auto()
HF_DETR = auto()
HF_OWLVIT = auto()
HF_GROUNDING_DINO = auto()
MULTIMODAL_CAPTION = auto()
MULTIMODAL_VQA = auto()Category resolution follows a priority chain:
- Exact registry match —
MODEL_REGISTRY.get(model_id) - File extension heuristic —
.pt→ YOLO - HuggingFace path inference —
detr,owlvit,grounding,blipkeywords - Fallback — Any path with
/→ DETR (HuggingFace)
Thread Safety
Model loading is protected by a reentrant lock:
self._lock = threading.RLock()
def get_model(self, model_id: str) -> LoadedModel:
with self._lock:
# Thread-safe cache access
if model_id in self._cache:
return self._cache[model_id]
# Load and cache
model = self._load_model(model_id)
self._cache[model_id] = model
return modelThis ensures:
- No race conditions when multiple requests hit a cold cache
- Atomic check-then-load operations
- Safe concurrent access from async handlers
Configuration Reference
| Environment Variable | Default | Description |
|---|---|---|
MODEL_CACHE_MAXSIZE | 10 | Maximum cached models |
MODEL_CACHE_TTL | 3600 | Cache TTL in seconds |
MODEL_MEMORY_THRESHOLD | 0.85 | Memory threshold (0-1) |
MAX_CONCURRENCY | 4 | Max concurrent inferences |
SKIP_WARMUP | false | Skip model warmup on startup |