Diagnosis Loop

A practical SGEMM tuning loop must separate observation, hypothesis, and validation.

End-to-end optimization loop

Use one loop per hypothesis. The loop is deliberately small because the goal is learning, not motion.

Signal	Likely bottleneck	First place to look
Naive to tiled jumps hard, later gains flatten	Memory movement is still dominant	Shared-memory reuse and global access patterns
Tiled improves, bank-free improves again	Shared-memory conflicts are real	Shared-memory layout and bank mapping
Double buffering underperforms expectations	Overlap is incomplete or occupancy fell	Register pressure, stage count, launch geometry
WMMA compute-only looks good, end-to-end does not	Conversion, staging, or fallback overhead dominates	FP32→FP16 staging and fast-path guards
Irregular shapes regress sharply	Alignment assumptions are too strong	Fallback path and shape-sensitive guards

Signal
WMMA end-to-end is close to, or below, FP32 kernels.

Likely causes

Actions

Signal
Tiled improves clearly, but Double Buffer and Tensor Core gains stay weak.

Likely causes

Actions

Signal
WMMA compute-only grows, while the full pipeline barely moves.

Likely causes

Actions

Stop the loop and hand the claim to Validation when:

If any of those fail, the right move is usually rollback, not explanation.