Performance Casebook
Architecture-aware SGEMM tuning notes from Volta to Hopper
One map from architecture to tuning focus
Practical case patterns
Case A: Tensor Core slower than expected on Volta/Turing
SignalWMMA end-to-end is close to, or below, FP32 kernels.
Likely causes
- Dimensions are frequently non-16-aligned, forcing fallback behavior.
- Conversion and wrapper overhead are larger than expected workload gain.
Actions
- Benchmark one aligned shape and one irregular shape side by side.
- Compare
WMMA compute-onlyandWMMA end-to-endexplicitly. - Keep fallback path unchanged while tuning conversion/staging boundaries.
Case B: Ampere/Ada gains stall after tiled kernel
SignalTiled improves clearly, but Double Buffer and Tensor Core gains are weak.
Likely causes
- Stage overlap is incomplete.
- Register pressure reduces active warps.
Actions
- Try a smaller block/tile configuration to recover occupancy.
- Check whether additional stages increase total time instead of reducing it.
- Validate that correctness still matches cuBLAS after each launch change.
Case C: Hopper compute-only looks strong but end-to-end remains flat
SignalWMMA compute-only scales, while full pipeline speedup is limited.
Likely causes
- Data movement or conversion flow dominates.
- Benchmark setup underestimates pipeline warmup effects.
Actions
- Increase warmup and benchmark iterations for stable timing windows.
- Profile conversion and launch overhead as a separate segment.
- Tune overlap strategy before touching micro-level compute code.
Reporting rules for trustworthy comparisons
- Always report GPU model, CUDA version, and whether numbers are end-to-end or compute-only.
- Never compare aligned-only numbers to mixed-shape baselines without labeling scope.
- Keep cuBLAS verification and tolerance policy unchanged while tuning performance.