Academy
The academy is the ordered learning surface of this repository. Architecture gives the system map. The academy gives the teaching sequence — the order in which each kernel stage is explained, and why that order is non-negotiable.
The structuring principle
Read kernels as a progression of bottleneck shifts, not as a list of tricks:
| Stage | Bottleneck exposed | Structural change introduced |
|---|---|---|
| Naïve FP32 | Unlimited DRAM traffic | Establishes the cost model |
| Tiled FP32 | Redundant global reads | Shared-memory staging |
| Bank-Free FP32 | Shared-memory bank conflicts | Tile padding |
| Double Buffer | Memory latency in critical path | Overlap staging and compute |
| Tensor Core WMMA | FP32 throughput ceiling | Hardware fragment accumulation |
Each later page assumes the previous page already explained why its extra complexity is justified. Reading out of order makes the causal chain invisible.
Academy map
| Track | Purpose | Start here |
|---|---|---|
| Orientation | Learn the route through the ladder before opening any kernel page | Learning Path |
| Experiment discipline | Avoid drawing conclusions from sloppy measurements | Benchmark Discipline |
| Bottleneck reasoning | Turn symptoms into the next defendable architectural change | Diagnosis Loop |
| Kernel deep dives | Inspect the actual optimization stages in sequence | Naive Kernel |
| Retention aids | Refresh memory hierarchy and tuning heuristics quickly | CUDA Memory Cheat Sheet |
Recommended reading order
- Learning Path — orientation before any kernel
- Naive Kernel — cost model baseline
- Tiled Kernel — shared-memory reuse
- Bank Conflict Free — stability under conflict shapes
- Double Buffer — latency hiding
- Tensor Core WMMA — guarded throughput ceiling
- Diagnosis Loop — turn measurements into decisions
- Optimization Playbook — structured tuning process
Interview-ready framing
When defending any kernel stage under review, use this four-part structure:
- Name the current bottleneck — what resource is saturated or wastefully used?
- Name the specific structural change — what does this kernel do differently at the hardware level?
- State the evidence requirement — what measurement would confirm the change helped?
- State the constraint — what assumption or shape condition limits this improvement?
That sequence keeps the discussion at the level of engineering reasoning rather than benchmark screenshots. The academy is designed to give you a defensible answer for each of the five stages.
What the academy is not
The academy is not a reference manual for CUDA programming. For reference, use the CUDA C++ Programming Guide and the CUDA Memory Cheat Sheet in this section.
The academy is not a substitute for reading the source code. Each kernel page explains the architectural reasoning; the code itself contains the implementation. Both are necessary to give a complete account of any stage.
Related resources
- Architecture Overview — the system map that contextualizes the ladder
- Validation Overview — the trust boundary for any number produced during academy study
- Performance Model — analytical cost model behind each ladder stage