Triton Fused Ops

Domain language for the repository’s user-facing kernel families and its support tooling. This exists to keep architecture discussions anchored to the same concepts used in the README and docs.

Language

Kernel family: A user-facing fused operation provided by the repository, such as fused_rmsnorm_rope, fused_gated_mlp, or fp8_gemm. Avoid: operator, primitive, workload

Benchmarking: The repository tooling that verifies correctness, measures latency, and reports comparative results for a Kernel family. Avoid: tuning, profiling harness

Auto-Tuning: The repository tooling that searches configuration spaces and caches the lowest-latency configuration for a Kernel family or other Triton callable. Avoid: benchmarking, runtime optimizer

Performance metrics: Derived throughput and bandwidth numbers computed from latency plus problem-shape context for a Kernel family. Avoid: tuning result, raw timing

Relationships

  • A Kernel family can be exercised by both Benchmarking and Auto-Tuning
  • Auto-Tuning selects configurations from latency measurements
  • Benchmarking reports Performance metrics
  • Performance metrics require problem-shape context in addition to latency

Example dialogue

Dev: “Should Auto-Tuning keep owning the throughput formulas for a Kernel family?” Domain expert: “No — Auto-Tuning owns latency-driven configuration search, while Performance metrics are shared support data that Benchmarking can report when it has shape context.”

Flagged ambiguities

  • “metrics” was being used for both raw latency from Auto-Tuning and derived Performance metrics — resolved: raw latency is part of tuning results, while throughput/bandwidth are Performance metrics that need shape context.