Performance Tuning
This page explains how to measure the shipped kernels correctly and how to reason about tuning work around them.
Start with the right question
The repository contains three different performance stories:
fused_rmsnorm_rope: primarily a memory-traffic reduction story,fused_gated_mlp: a fusion and launch-overhead reduction story,fp8_gemm: a quantization plus matrix-multiplication throughput story.
Treat them differently when benchmarking.
Correct timing pattern
import time
import torch
from triton_ops import fused_rmsnorm_rope
x = torch.randn(8, 2048, 4096, device="cuda", dtype=torch.float16)
weight = torch.ones(4096, device="cuda", dtype=torch.float16)
cos = torch.randn(2048, 64, device="cuda", dtype=torch.float16)
sin = torch.randn(2048, 64, device="cuda", dtype=torch.float16)
for _ in range(10):
_ = fused_rmsnorm_rope(x, weight, cos, sin)
torch.cuda.synchronize()
start = time.perf_counter()
for _ in range(100):
_ = fused_rmsnorm_rope(x, weight, cos, sin)
torch.cuda.synchronize()
end = time.perf_counter()
print((end - start) / 100 * 1000)
Always include:
- warmup runs,
- explicit synchronization before and after timing,
- representative shapes from your target model.
Use the built-in benchmark layer when possible
BenchmarkSuite already wraps warmup, repeated execution, correctness verification, and report generation.
Use it when you want comparable outputs across multiple experiments.
Interpreting metrics
The repository provides two metric helpers:
compute_gemm_metricscompute_elementwise_metrics
Use them to distinguish between:
- computational throughput for GEMM-like work,
- effective bandwidth for elementwise or reduction-heavy kernels.
Tuning custom kernels
TritonAutoTuner is useful when you own a custom kernel wrapper and want to search configuration space over:
- block sizes,
- warp counts,
- other keyword-parameterized launch choices.
The shipped kernel entry points do not automatically search config space during normal calls.
Practical bottleneck checklist
For fused_rmsnorm_rope
- check that RoPE cache shapes are correct and contiguous,
- keep hidden dimensions aligned with the head layout,
- treat memory traffic as the primary optimization target.
For fused_gated_mlp
- benchmark on realistic intermediate dimensions,
- evaluate activation choice explicitly,
- remember that a full FFN includes additional surrounding work not measured by this kernel alone.
For fp8_gemm
- compare auto-quantization against explicit pre-quantization,
- validate the numerical error against an FP16 baseline,
- benchmark representative matrix aspect ratios rather than only square matrices.
What not to trust blindly
- A single GPU architecture result does not generalize to every deployment target.
- A benchmark without synchronization is not meaningful.
- A latency improvement on isolated kernels does not automatically translate into identical end-to-end model speedup.