References
This list maps project decisions to authoritative technical sources.
CUDA and GPU fundamentals
Why this group matters:
- Defines execution model assumptions used by all kernel stages.
- Anchors memory and synchronization discussions in official terminology.
Tensor Core and WMMA
- NVIDIA WMMA API Reference
- NVIDIA Developer Blog: Programming Tensor Cores in CUDA 9
- NVIDIA Mixed-Precision Training Guide
Why this group matters:
- Supports WMMA fragment, alignment, and mixed-precision behavior discussion.
- Explains why fallback policies are necessary for non-friendly shapes.
GEMM optimization research and methodology
- Anatomy of High-Performance Matrix Multiplication (GotoBLAS paper)
- CUTLASS: Fast Linear Algebra in CUDA C++
- BLIS Framework
Why this group matters:
- Connects this project's staged optimization mindset to broader GEMM methodology.
- Provides production-grade references for interview follow-up discussions.
Profiling and performance analysis
Why this group matters:
- Supports diagnosis beyond single GFLOPS outputs.
- Enables metric-driven explanations of bottlenecks and trade-offs.
Engineering process and validation discipline
Why this group matters:
- Grounds the repository's correctness and workflow claims in established tooling.
- Reinforces the local-GPU vs hosted-CI validation boundary model.