Skip to content

Academy

The academy is the ordered learning surface of this repository. Architecture gives the system map. The academy gives the teaching sequence — the order in which each kernel stage is explained, and why that order is non-negotiable.

The structuring principle

Read kernels as a progression of bottleneck shifts, not as a list of tricks:

StageBottleneck exposedStructural change introduced
Naïve FP32Unlimited DRAM trafficEstablishes the cost model
Tiled FP32Redundant global readsShared-memory staging
Bank-Free FP32Shared-memory bank conflictsTile padding
Double BufferMemory latency in critical pathOverlap staging and compute
Tensor Core WMMAFP32 throughput ceilingHardware fragment accumulation

Each later page assumes the previous page already explained why its extra complexity is justified. Reading out of order makes the causal chain invisible.

Academy map

TrackPurposeStart here
OrientationLearn the route through the ladder before opening any kernel pageLearning Path
Experiment disciplineAvoid drawing conclusions from sloppy measurementsBenchmark Discipline
Bottleneck reasoningTurn symptoms into the next defendable architectural changeDiagnosis Loop
Kernel deep divesInspect the actual optimization stages in sequenceNaive Kernel
Retention aidsRefresh memory hierarchy and tuning heuristics quicklyCUDA Memory Cheat Sheet
  1. Learning Path — orientation before any kernel
  2. Naive Kernel — cost model baseline
  3. Tiled Kernel — shared-memory reuse
  4. Bank Conflict Free — stability under conflict shapes
  5. Double Buffer — latency hiding
  6. Tensor Core WMMA — guarded throughput ceiling
  7. Diagnosis Loop — turn measurements into decisions
  8. Optimization Playbook — structured tuning process

Interview-ready framing

When defending any kernel stage under review, use this four-part structure:

  1. Name the current bottleneck — what resource is saturated or wastefully used?
  2. Name the specific structural change — what does this kernel do differently at the hardware level?
  3. State the evidence requirement — what measurement would confirm the change helped?
  4. State the constraint — what assumption or shape condition limits this improvement?

That sequence keeps the discussion at the level of engineering reasoning rather than benchmark screenshots. The academy is designed to give you a defensible answer for each of the five stages.

What the academy is not

The academy is not a reference manual for CUDA programming. For reference, use the CUDA C++ Programming Guide and the CUDA Memory Cheat Sheet in this section.

The academy is not a substitute for reading the source code. Each kernel page explains the architectural reasoning; the code itself contains the implementation. Both are necessary to give a complete account of any stage.

MIT Licensed