Reference Map

This page is a structured index of the external sources that back the claims in this whitepaper. Each entry is classified by type and linked to the section it supports most directly.

Primary technical references

CUDA and GPU architecture

Source	What it establishes	Relevant section
CUDA C++ Programming Guide	Memory hierarchy, warp execution model, shared memory layout	Architecture, Academy
CUDA Best Practices Guide	Memory coalescing, occupancy, bank conflict avoidance	Academy (kernel pages)
PTX ISA Reference	WMMA instruction semantics, matrix fragment layout	Tensor Core path

cuBLAS

Source	What it establishes	Relevant section
cuBLAS Developer Guide	GEMM API, precision modes, leading-dimension conventions	Validation (oracle definition)

Tensor Core / WMMA

Source	What it establishes	Relevant section
WMMA API documentation	Fragment types, load/store/compute API	Academy (kernel-tensor-core), Architecture (tensor-core-path)
Volta architecture whitepaper	First-generation Tensor Core throughput model	Research (evolution), Performance model

Foundational papers

Paper	Contribution	Primary support for
Goto & van de Geijn (2008) — Anatomy of High-Performance Matrix Multiplication	Hierarchical blocking theory for GEMM on CPUs	Tiled kernel design, shared-memory staging rationale
Lai & Seznec (2013) — Performance Upper Bound Analysis and Optimization of SGEMM on Fermi and Kepler GPUs	GPU SGEMM tiling and occupancy analysis	Tiled kernel, double-buffer motivation
Whaley & Dongarra (1998) — ATLAS	Automated tuning of block sizes	Historical context for tile-size sensitivity
Markidis et al. (2018) — NVIDIA Tensor Core Programmability, Performance & Precision	WMMA programming model and mixed-precision behavior	Tensor Core path design

Repository	Relationship	Notes
CUTLASS	Authoritative production GEMM kernel library	The ceiling above which this project does not claim to compete
tinygrad / BEAM SGEMM	Community SGEMM exploration	Different educational framing; useful for contrast
siboehm/CUDA-GEMM-Optimization	Step-by-step SGEMM tutorial	Most directly comparable educational structure
wangzyon/NVIDIA_SGEMM_PRACTICE	Chinese-language SGEMM practice repository	Bilingual contrast; different kernel progression

How to use this map

This reference map is not a bibliography to be cited at the end of a paper. It is a live index that connects each claim in the whitepaper to its supporting source.

If you want to challenge a claim:

Find the section in the whitepaper that makes the claim.
Find the supporting source in the table above.
Open the source and check whether the claim is appropriately scoped.

If the claim is not in the table, it is either derived from the implementation itself (verifiable by reading the code) or it is an open question explicitly labeled as such in the text.

Curated References — full annotated reference list with reading notes
Papers — focused academic reading list
Related Projects — comparative context for the project scope
Evolution Notes — how the external sources shaped the current design
Performance Casebook — how to interpret measured results against external benchmarks

Reference Map ​

Primary technical references ​

CUDA and GPU architecture ​

cuBLAS ​

Tensor Core / WMMA ​

Foundational papers ​

Related open-source implementations ​

How to use this map ​

Related pages ​