Related Papers & Research
This page traces the design decisions in this project back to academic foundations. Each citation connects a kernel optimization or architectural choice to its theoretical or empirical source.
Memory Hierarchy Optimization
These papers explain why blocking and tiling are fundamental to matrix multiplication performance.
The foundational paper for understanding why matrix multiplication performance is dominated by memory hierarchy. This project's kernel ladder follows the same blocking philosophy, adapted to CUDA's shared memory and register file hierarchy.
A GPU-specific treatment of GEMM optimization. Useful for understanding how CUDA's execution model changes the blocking strategy compared to CPU BLAS.
Bank Conflict Avoidance
These sources explain the shared memory bank conflict problem and the padding solution used in this project.
Practical treatment of bank conflict avoidance in shared memory. The padding strategy in Bank-Free Kernel follows this approach.
Official documentation for shared memory bank conflicts, access patterns, and the 32-bank architecture on modern GPUs.
Double Buffering and Latency Hiding
These sources explain the overlap strategy used in the double-buffer kernel.
While focused on reduction, this whitepaper introduces the double-buffer concept for overlapping memory transfers with computation. The Double Buffer Kernel applies this to GEMM's tile load-compute cycle.
Tensor Core and Mixed Precision
These sources explain the WMMA API and mixed-precision performance characteristics.
Microbenchmarking study of Volta Tensor Cores. Useful for understanding the actual throughput and latency characteristics behind the Tensor Core WMMA Kernel.
Official introduction to the WMMA API. This is the primary reference for fragment types, shape constraints, and the mixed-precision semantics used in this project.
Performance Modeling
These sources provide the theoretical framework for interpreting benchmark results.
The original roofline model paper. Provides the vocabulary for discussing arithmetic intensity, memory bandwidth limits, and compute ceilings that underpins Benchmark Discipline.
How to Use This Page
- Before reading a kernel: Open the corresponding citation to understand the optimization principle.
- After reading a kernel: Use the citation to check whether your mental model matches the published explanation.
- For interviews: These citations provide the academic grounding for explaining why each optimization works.
BibTeX Export
For LaTeX documents or academic writing, you can copy the following BibTeX entries:
@article{Goto2008,
author = {Kazushige Goto and Robert A. van de Geijn},
title = {Anatomy of High-Performance Matrix Multiplication},
journal = {ACM Transactions on Mathematical Software},
year = {2008},
volume = {34},
number = {3},
doi = {10.1145/1391989.1391995}
}
@article{Hong2012,
author = {Taesoo Hong and Hyesoon Kim and Sang-Woo Park},
title = {GPU Performance Optimization: A Case Study with Matrix Multiplication},
journal = {IEEE Transactions on Parallel and Distributed Systems},
year = {2012},
volume = {23},
number = {6},
doi = {10.1109/TPDS.2012.279}
}
@inbook{Ruetsch2009,
author = {Gregory Ruetsch and Massimiliano Fatica},
title = {Optimizing Matrix Multiply on GPUs},
booktitle = {GPU Computing Gems},
year = {2009},
publisher = {Morgan Kaufmann}
}
@article{Williams2009,
author = {Samuel Williams and Andrew Waterman and David Patterson},
title = {Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures},
journal = {Communications of the ACM},
year = {2009},
volume = {52},
number = {4},
doi = {10.1145/1498775.1498785}
}Next Steps
- Curated References — Full catalog of documentation, tools, and codebases
- Further Reading Routes — Opinionated study paths
- Resources Hub — Scenario-based entry points