Skip to content

Resources Hub

This is the curated handoff for readers who want to keep learning after the architecture and methodology sections.

Use this section when you need more than a flat bibliography: each route below explains what to open next, why it matters, and which project question it helps answer.

Start with the question you are trying to answer

If your question sounds like thisStart hereWhy this route is useful
"Why does this kernel spend so much effort on memory layout?"CUDA Memory Cheat SheetFast refresher on coalescing, shared-memory layout, Tensor Core alignment, and the checks worth doing before you trust a timing result.
"Which official docs justify the constraints used in this whitepaper?"Curated ReferencesJumps straight to the CUDA, WMMA, and runtime references behind the execution-model and memory-model claims.
"What do strong SGEMM implementations look like in the wild?"Curated ReferencesPoints to mature repositories and sample implementations so you can compare this project's teaching ladder with production-style code.
"What should I study after finishing this site?"Further Reading RoutesOrganizes adjacent topics into deliberate routes instead of expecting you to guess which external paper or tool matters next.
"How do I turn a benchmark symptom into evidence?"Diagnosis Loop + Profiler and tooling referencesConnects the site's internal workflow with the external tools that help confirm occupancy, memory, and scheduling hypotheses.

Curated shelves

Official docs for constraints and terminology

Start here when you need the authoritative wording behind a claim in the whitepaper.

Papers and mental models for reasoning, not just citation

Use these when you want the design logic behind the kernel ladder, not just the API surface.

Exemplary codebases when you want to compare styles

These links help you see where this repository is intentionally simplified for teaching and where production code adds more abstraction.

Tooling when performance numbers stop being self-explanatory

Use these when "the benchmark changed" is not enough and you need evidence.

Suggested study routes

Route 1: Understand memory before touching another optimization

  1. Read Memory Flow.
  2. Use the CUDA Memory Cheat Sheet to sanity-check what one warp is loading, where reuse appears, and how shared memory changes the access pattern.
  3. Continue to Further Reading: GEMM tiling when you want stronger mental models.

Route 2: Validate Tensor Core claims without hand-waving

  1. Read Tensor Core Path and Tensor Core WMMA.
  2. Open the WMMA and mixed-precision references to confirm shape, alignment, and fallback constraints.
  3. Continue to Further Reading: Tensor Core constraints for the adjacent topics that usually get skipped.

Route 3: Move from benchmark curiosity to profiler-led diagnosis

  1. Start with Diagnosis Loop.
  2. Pair it with Nsight and occupancy references.
  3. Continue to Further Reading: Profiling workflow when you want a next-step checklist.

MIT Licensed