Kernel Ladder

The kernel ladder is the repository's main reasoning chain. Each rung exists because the previous one exposed a bottleneck that the next one addresses.

The ladder at a glance

Stage	Main bottleneck exposed	Architectural move	Trade-off	Deep dive
Naïve	Repeated global-memory traffic, poor reuse	One thread computes one output element	Readable but memory-bound	Naïve Kernel
Tiled	Global-memory bandwidth pressure	Shared-memory tiling and block-level reuse	Requires barriers and tile-size choices	Tiled Kernel
Bank-Free	Shared-memory contention	Pad tiles to remove bank-conflict patterns	Slightly higher shared-memory footprint	Bank Conflict Free
Double Buffer	Load/compute serialization	Ping-pong shared-memory buffers	More shared memory and more scheduling complexity	Double Buffer
Tensor Core	CUDA-core throughput ceiling	Warp-level WMMA on Tensor Cores	FP16 staging, shape guards, and fallback policy	Tensor Core WMMA

Why the ladder matters

Many SGEMM demos present multiple kernels as separate tricks. This repository keeps them in one narrative so the reader can answer four questions at every step:

What bottleneck was visible?
What architectural mechanism addresses it?
What new cost or constraint appears?
How should that change be validated?

That makes the ladder useful for both technical study and interview storytelling.

Rung-by-rung reasoning chain

1. Naïve: make the bottleneck obvious

The naïve kernel is intentionally simple. It gives every thread one output element and performs a straightforward inner product. That simplicity makes the memory problem visible:

rows of A are reread many times
columns of B are accessed with poor global-memory locality
arithmetic work is easy to describe, but data reuse is weak

Use it as the baseline for explaining why SGEMM is often limited by data movement before clever compute scheduling even begins.

2. Tiled: buy reuse with cooperation

Tiling changes the story from “each thread fetches what it needs” to “the block stages a reusable working set in shared memory.”

The architectural effect is larger than the code change suggests:

global-memory accesses become more coalesced
the same staged tile is reused across many multiply-accumulate steps
synchronization becomes a real correctness requirement

Tiling is the first step where the system starts looking like GPU architecture rather than plain matrix math.

3. Bank-Free: fix the shared-memory shape, not just the algorithm

Once shared memory becomes central, its layout matters. The bank-free kernel exists because a tiled kernel can still waste cycles when many threads hit the same bank pattern.

Padding the shared tile changes the address stride so that the common access pattern no longer collapses into avoidable conflicts. Even when the sample benchmark does not show a dramatic speedup over tiled, the reasoning still matters: the layout is more robust and easier to defend as a system design choice.

4. Double Buffer: turn the tile loop into a schedule

After shared-memory reuse and bank layout are under control, the next question is timing: can the next tile be staged while the current one is still being consumed?

Double buffering answers that by turning one tile buffer into two alternating roles:

current buffer feeds compute
next buffer is loaded for the following iteration

This is the point where the ladder becomes a scheduling story, not only a memory-layout story.

5. Tensor Core: add a guarded fast path

Tensor Core acceleration is intentionally the last rung because it depends on the earlier reasoning:

readers already understand why data movement matters
the project can compare WMMA results against strong FP32 baselines
the implementation can explain why a faster path must still be guarded

The repository treats WMMA as a constrained acceleration path, not a universal replacement. Friendly shapes can use compute-only WMMA benchmarking; unsupported shapes must stay honest and fall back.

What the ladder is trying to teach

Lesson	Where it appears
Global-memory access pattern dominates first-order behavior	Naïve → Tiled
Shared memory is useful only when its layout is also correct	Tiled → Bank-Free
Scheduling and overlap matter after locality is improved	Bank-Free → Double Buffer
Higher-throughput hardware introduces API and shape constraints	Double Buffer → Tensor Core
Correctness, measurement scope, and fallback rules belong in the design narrative	Whole ladder

How to talk about performance without oversimplifying

The ladder is a reasoning chain, not a promise of perfectly monotonic benchmark numbers on every machine. For example:

tiled and bank-free may trade places depending on occupancy and access details
double buffering can bring only modest gains on GPUs that already hide latency well
Tensor Core numbers must be split into end-to-end and compute-only views

The right question is not “which page has the biggest number?” but “which bottleneck does this step address, and under what assumptions?”

Kernel Ladder ​

The ladder at a glance ​

Why the ladder matters ​

Rung-by-rung reasoning chain ​

1. Naïve: make the bottleneck obvious ​

2. Tiled: buy reuse with cooperation ​

3. Bank-Free: fix the shared-memory shape, not just the algorithm ​

4. Double Buffer: turn the tile loop into a schedule ​

5. Tensor Core: add a guarded fast path ​

What the ladder is trying to teach ​

How to talk about performance without oversimplifying ​

Related pages ​