Learning Path
Follow the optimization ladder in the order the repository was designed to teach it
What each stage teaches
Naive -> Tiled
- Thread/block mapping
- Memory coalescing
- Shared-memory reuse
Tiled -> Bank-Free
- 32-bank shared-memory behavior
- Why
[32][33]matters
Bank-Free -> Double Buffer
- Pipeline thinking
- Tile staging and latency hiding
Double Buffer -> Tensor Core
- WMMA fragments
- Mixed precision
- Safe fallback behavior for unsupported shapes
Before you start
- Make sure your environment follows Getting Started
- Use the Architecture page if you want the repository-level map first
- Keep the Specifications Index nearby if you want the normative requirements