Skip to content

Further Reading Routes

This page is intentionally opinionated. It is here to answer "what should I study next, and why?" without making you decode a random list of links.

GEMM tiling and hierarchy thinking

Study this route when tiled SGEMM makes sense mechanically, but the bigger design logic still feels fuzzy.

Questions to keep in mind:

  • Which memory level is each tile protecting?
  • Which part of the design reduces bandwidth pressure versus launch overhead?
  • What changed between the teaching kernel and a production template stack?

Occupancy as a constraint, not a vanity metric

Study this route when you keep hearing "occupancy" but cannot tell whether it is the cause of a slowdown or just a correlated number.

Questions to keep in mind:

  • Did occupancy drop because the kernel got worse, or because it now does more useful work per block?
  • Which resource is binding first: registers, shared memory, or block size?
  • What profiler metric would falsify your current story?

Roofline thinking for SGEMM

Study this route when you want a better language for "memory-bound" versus "compute-bound" than intuition alone.

Questions to keep in mind:

  • Did the optimization raise arithmetic intensity, reduce latency, or only move work around?
  • Is the kernel limited by memory traffic, instruction mix, or launch geometry?
  • Which evidence would justify saying the next optimization should target Tensor Cores instead of memory movement?

Tensor Core constraints and fallback design

Study this route when WMMA looks fast in a chart but fragile in real workloads.

Questions to keep in mind:

  • Which input shapes are "Tensor Core friendly" and which are not?
  • What part of the timing is actual matrix multiply work versus conversion or wrapper overhead?
  • When is the FP32 fallback the more honest engineering choice?

Profiling from symptoms to evidence

Study this route when you know a result changed but do not yet know why.

Questions to keep in mind:

  • Is the symptom on the timeline, inside one kernel, or only in aggregate benchmark output?
  • Which metric will tell you whether the bottleneck is bandwidth, occupancy, latency hiding, or invalid assumptions?
  • What would you need to capture before making a public performance claim?

Pick a route by current goal

GoalBest route
Build stronger intuition for shared-memory tilingGEMM tiling and hierarchy thinking
Learn to talk about occupancy without cargo-culting itOccupancy as a constraint, not a vanity metric
Explain performance limits with a better modelRoofline thinking for SGEMM
Understand when Tensor Cores help and when they complicate the storyTensor Core constraints and fallback design
Turn profiler output into a debugging planProfiling from symptoms to evidence

MIT Licensed