Academy

The academy is the ordered learning surface of this repository. Architecture gives the system map. The academy gives the teaching sequence — the order in which each kernel stage is explained, and why that order is non-negotiable.

The structuring principle

Read kernels as a progression of bottleneck shifts, not as a list of tricks:

Stage	Bottleneck exposed	Structural change introduced
Naïve FP32	Unlimited DRAM traffic	Establishes the cost model
Tiled FP32	Redundant global reads	Shared-memory staging
Bank-Free FP32	Shared-memory bank conflicts	Tile padding
Double Buffer	Memory latency in critical path	Overlap staging and compute
Tensor Core WMMA	FP32 throughput ceiling	Hardware fragment accumulation

Each later page assumes the previous page already explained why its extra complexity is justified. Reading out of order makes the causal chain invisible.

Academy map

Track	Purpose	Start here
Orientation	Learn the route through the ladder before opening any kernel page	Learning Path
Experiment discipline	Avoid drawing conclusions from sloppy measurements	Benchmark Discipline
Bottleneck reasoning	Turn symptoms into the next defendable architectural change	Diagnosis Loop
Kernel deep dives	Inspect the actual optimization stages in sequence	Naive Kernel
Retention aids	Refresh memory hierarchy and tuning heuristics quickly	CUDA Memory Cheat Sheet

Interview-ready framing

When defending any kernel stage under review, use this four-part structure:

Name the current bottleneck — what resource is saturated or wastefully used?
Name the specific structural change — what does this kernel do differently at the hardware level?
State the evidence requirement — what measurement would confirm the change helped?
State the constraint — what assumption or shape condition limits this improvement?

That sequence keeps the discussion at the level of engineering reasoning rather than benchmark screenshots. The academy is designed to give you a defensible answer for each of the five stages.

What the academy is not

The academy is not a reference manual for CUDA programming. For reference, use the CUDA C++ Programming Guide and the CUDA Memory Cheat Sheet in this section.

The academy is not a substitute for reading the source code. Each kernel page explains the architectural reasoning; the code itself contains the implementation. Both are necessary to give a complete account of any stage.

Architecture Overview — the system map that contextualizes the ladder
Validation Overview — the trust boundary for any number produced during academy study
Performance Model — analytical cost model behind each ladder stage

Academy ​

The structuring principle ​

Academy map ​

Recommended reading order ​

Interview-ready framing ​

What the academy is not ​

Related resources ​

Academy

The structuring principle

Academy map

Recommended reading order

Interview-ready framing

What the academy is not

Related resources