Related Papers & Research

This page traces the design decisions in this project back to academic foundations. Each citation connects a kernel optimization or architectural choice to its theoretical or empirical source.

Memory Hierarchy Optimization

These papers explain why blocking and tiling are fundamental to matrix multiplication performance.

[Goto2008]Anatomy of High-Performance Matrix Multiplication — Kazushige Goto, Robert A. van de Geijn (2008), ACM TOMS

DOI: 10.1145/1391989.1391995

The foundational paper for understanding why matrix multiplication performance is dominated by memory hierarchy. This project's kernel ladder follows the same blocking philosophy, adapted to CUDA's shared memory and register file hierarchy.

[Hong2012]GPU Performance Optimization: A Case Study with Matrix Multiplication — Taesoo Hong, Hyesoon Kim, Sang-Woo Park (2012), IEEE TPDS

DOI: 10.1109/TPDS.2012.279

A GPU-specific treatment of GEMM optimization. Useful for understanding how CUDA's execution model changes the blocking strategy compared to CPU BLAS.

Bank Conflict Avoidance

These sources explain the shared memory bank conflict problem and the padding solution used in this project.

[Ruetsch2009]Optimizing Matrix Multiply on GPUs — Gregory Ruetsch, Massimiliano Fatica (2009), GPU Computing Gems

Practical treatment of bank conflict avoidance in shared memory. The padding strategy in Bank-Free Kernel follows this approach.

[Nvidia2007]CUDA Programming Guide: Shared Memory — NVIDIA Corporation (2007)

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory

Official documentation for shared memory bank conflicts, access patterns, and the 32-bank architecture on modern GPUs.

Double Buffering and Latency Hiding

These sources explain the overlap strategy used in the double-buffer kernel.

[Harris2007]Optimizing Parallel Reduction in CUDA — Mark Harris (2007), NVIDIA Developer Technology

https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf

While focused on reduction, this whitepaper introduces the double-buffer concept for overlapping memory transfers with computation. The Double Buffer Kernel applies this to GEMM's tile load-compute cycle.

Tensor Core and Mixed Precision

These sources explain the WMMA API and mixed-precision performance characteristics.

[Jia2018]Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking — Zhe Jia, Marco Maggioni, Jeffrey Smith, Daniele P. Scarpazza (2018), arXiv

DOI: 10.48550/arXiv.1804.06826

Microbenchmarking study of Volta Tensor Cores. Useful for understanding the actual throughput and latency characteristics behind the Tensor Core WMMA Kernel.

[Nvidia2017]Programming Tensor Cores in CUDA 9 — NVIDIA Corporation (2017)

https://developer.nvidia.com/blog/programming-tensor-cores-cuda-9/

Official introduction to the WMMA API. This is the primary reference for fragment types, shape constraints, and the mixed-precision semantics used in this project.

Performance Modeling

These sources provide the theoretical framework for interpreting benchmark results.

[Williams2009]Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures — Samuel Williams, Andrew Waterman, David Patterson (2009), CACM

DOI: 10.1145/1498775.1498785

The original roofline model paper. Provides the vocabulary for discussing arithmetic intensity, memory bandwidth limits, and compute ceilings that underpins Benchmark Discipline.

How to Use This Page

Before reading a kernel: Open the corresponding citation to understand the optimization principle.
After reading a kernel: Use the citation to check whether your mental model matches the published explanation.
For interviews: These citations provide the academic grounding for explaining why each optimization works.

BibTeX Export

For LaTeX documents or academic writing, you can copy the following BibTeX entries:

bibtex

@article{Goto2008,
  author    = {Kazushige Goto and Robert A. van de Geijn},
  title     = {Anatomy of High-Performance Matrix Multiplication},
  journal   = {ACM Transactions on Mathematical Software},
  year      = {2008},
  volume    = {34},
  number    = {3},
  doi       = {10.1145/1391989.1391995}
}

@article{Hong2012,
  author    = {Taesoo Hong and Hyesoon Kim and Sang-Woo Park},
  title     = {GPU Performance Optimization: A Case Study with Matrix Multiplication},
  journal   = {IEEE Transactions on Parallel and Distributed Systems},
  year      = {2012},
  volume    = {23},
  number    = {6},
  doi       = {10.1109/TPDS.2012.279}
}

@inbook{Ruetsch2009,
  author    = {Gregory Ruetsch and Massimiliano Fatica},
  title     = {Optimizing Matrix Multiply on GPUs},
  booktitle = {GPU Computing Gems},
  year      = {2009},
  publisher = {Morgan Kaufmann}
}

@article{Williams2009,
  author    = {Samuel Williams and Andrew Waterman and David Patterson},
  title     = {Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures},
  journal   = {Communications of the ACM},
  year      = {2009},
  volume    = {52},
  number    = {4},
  doi       = {10.1145/1498775.1498785}
}

Next Steps

Curated References — Full catalog of documentation, tools, and codebases
Further Reading Routes — Opinionated study paths
Resources Hub — Scenario-based entry points

Related Papers & Research ​

Memory Hierarchy Optimization ​

Bank Conflict Avoidance ​

Double Buffering and Latency Hiding ​

Tensor Core and Mixed Precision ​

Performance Modeling ​

How to Use This Page ​

BibTeX Export ​

Next Steps ​