Reference Map
This page is a structured index of the external sources that back the claims in this whitepaper. Each entry is classified by type and linked to the section it supports most directly.
Primary technical references
CUDA and GPU architecture
| Source | What it establishes | Relevant section |
|---|---|---|
| CUDA C++ Programming Guide | Memory hierarchy, warp execution model, shared memory layout | Architecture, Academy |
| CUDA Best Practices Guide | Memory coalescing, occupancy, bank conflict avoidance | Academy (kernel pages) |
| PTX ISA Reference | WMMA instruction semantics, matrix fragment layout | Tensor Core path |
cuBLAS
| Source | What it establishes | Relevant section |
|---|---|---|
| cuBLAS Developer Guide | GEMM API, precision modes, leading-dimension conventions | Validation (oracle definition) |
Tensor Core / WMMA
| Source | What it establishes | Relevant section |
|---|---|---|
| WMMA API documentation | Fragment types, load/store/compute API | Academy (kernel-tensor-core), Architecture (tensor-core-path) |
| Volta architecture whitepaper | First-generation Tensor Core throughput model | Research (evolution), Performance model |
Foundational papers
| Paper | Contribution | Primary support for |
|---|---|---|
| Goto & van de Geijn (2008) — Anatomy of High-Performance Matrix Multiplication | Hierarchical blocking theory for GEMM on CPUs | Tiled kernel design, shared-memory staging rationale |
| Lai & Seznec (2013) — Performance Upper Bound Analysis and Optimization of SGEMM on Fermi and Kepler GPUs | GPU SGEMM tiling and occupancy analysis | Tiled kernel, double-buffer motivation |
| Whaley & Dongarra (1998) — ATLAS | Automated tuning of block sizes | Historical context for tile-size sensitivity |
| Markidis et al. (2018) — NVIDIA Tensor Core Programmability, Performance & Precision | WMMA programming model and mixed-precision behavior | Tensor Core path design |
Related open-source implementations
| Repository | Relationship | Notes |
|---|---|---|
| CUTLASS | Authoritative production GEMM kernel library | The ceiling above which this project does not claim to compete |
| tinygrad / BEAM SGEMM | Community SGEMM exploration | Different educational framing; useful for contrast |
| siboehm/CUDA-GEMM-Optimization | Step-by-step SGEMM tutorial | Most directly comparable educational structure |
| wangzyon/NVIDIA_SGEMM_PRACTICE | Chinese-language SGEMM practice repository | Bilingual contrast; different kernel progression |
How to use this map
This reference map is not a bibliography to be cited at the end of a paper. It is a live index that connects each claim in the whitepaper to its supporting source.
If you want to challenge a claim:
- Find the section in the whitepaper that makes the claim.
- Find the supporting source in the table above.
- Open the source and check whether the claim is appropriately scoped.
If the claim is not in the table, it is either derived from the implementation itself (verifiable by reading the code) or it is an open question explicitly labeled as such in the text.
Related pages
- Curated References — full annotated reference list with reading notes
- Papers — focused academic reading list
- Related Projects — comparative context for the project scope
- Evolution Notes — how the external sources shaped the current design
- Performance Casebook — how to interpret measured results against external benchmarks