Learning Path

Follow the optimization ladder in the order the repository was designed to teach it

Recommended order

Step	Kernel	Why it comes here
1	Naive	Establish the baseline cost model
2	Tiled	Introduce shared-memory reuse
3	Bank-Free	Show why shared-memory layout still matters
4	Double Buffer	Add staging and overlap concepts
5	Tensor Core	Move to WMMA and mixed-precision hardware

What each stage teaches

Naive -> Tiled

Thread/block mapping
Memory coalescing
Shared-memory reuse

Tiled -> Bank-Free

32-bank shared-memory behavior
Why [32][33] matters

Bank-Free -> Double Buffer

Pipeline thinking
Tile staging and latency hiding

Double Buffer -> Tensor Core

WMMA fragments
Mixed precision
Safe fallback behavior for unsupported shapes

Suggested reading rhythm

Build and run the project first
Read the kernel page for one stage
Run the benchmark again
Compare the code with the previous stage
Move to the next optimization only after the current one is clear

Before you start

Make sure your environment follows Getting Started
Use the Architecture page if you want the repository-level map first
Keep the Specifications Index nearby if you want the normative requirements