Learning Path
Follow the optimization ladder in the order the repository was designed to teach it
Recommended order
| Step | Kernel | Why it comes here |
|---|---|---|
| 1 | Naive | Establish the baseline cost model |
| 2 | Tiled | Introduce shared-memory reuse |
| 3 | Bank-Free | Show why shared-memory layout still matters |
| 4 | Double Buffer | Add staging and overlap concepts |
| 5 | Tensor Core | Move to WMMA and mixed-precision hardware |
What each stage teaches
Naive -> Tiled
- Thread/block mapping
- Memory coalescing
- Shared-memory reuse
Tiled -> Bank-Free
- 32-bank shared-memory behavior
- Why
[32][33]matters
Bank-Free -> Double Buffer
- Pipeline thinking
- Tile staging and latency hiding
Double Buffer -> Tensor Core
- WMMA fragments
- Mixed precision
- Safe fallback behavior for unsupported shapes
Suggested reading rhythm
- Build and run the project first
- Read the kernel page for one stage
- Run the benchmark again
- Compare the code with the previous stage
- Move to the next optimization only after the current one is clear
Before you start
- Make sure your environment follows Getting Started
- Use the Architecture page if you want the repository-level map first
- Keep the Specifications Index nearby if you want the normative requirements