SGEMM Optimization
From readable baseline code to Tensor Core WMMA
Why this project is useful
This repository is designed to be a compact CUDA GEMM learning and reference project:
- Progressive: five kernel variants show what each optimization step changes
- Verifiable: every kernel is checked against cuBLAS
- Practical: benchmark and test entry points are already wired up
- Maintainable: repository rules, workflow, and validation are documented through OpenSpec
Optimization ladder
| Stage | Kernel | What you learn |
|---|---|---|
| 1 | Naive | Thread-to-output mapping and baseline cost |
| 2 | Tiled | Shared-memory blocking and data reuse |
| 3 | Bank-Free | Padding away 32-way bank conflicts |
| 4 | Double Buffer | Latency hiding through staged tiles |
| 5 | Tensor Core | WMMA usage with a guarded fallback path |
What is inside the repository
| Surface | Purpose |
|---|---|
src/ |
CUDA kernels, benchmark entry point, and utilities |
tests/ |
Google Test verification against cuBLAS |
docs/ |
Learning-oriented technical documentation |
openspec/ |
Stable requirements, workflow, and change history |
How to use it
1. Build and run
1
2
3
4
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
./build/bin/sgemm_benchmark -a
ctest --test-dir build
2. Follow the learning route
3. Inspect the project rules
Validation boundary
- Local GPU machine: runtime verification and benchmarking
- Hosted CI: formatting, compilation, OpenSpec/repository checks, and Pages
That split keeps the repository honest without pretending GitHub-hosted runners can replace a real CUDA runtime environment.
Explore next
📘 Benchmark results 🏗️ Architecture overview ⭐ View on GitHub