SGEMM Optimization

From readable baseline code to Tensor Core WMMA

🚀 Start Here 📚 Learning Path


Why this project is useful

This repository is designed to be a compact CUDA GEMM learning and reference project:

  • Progressive: five kernel variants show what each optimization step changes
  • Verifiable: every kernel is checked against cuBLAS
  • Practical: benchmark and test entry points are already wired up
  • Maintainable: repository rules, workflow, and validation are documented through OpenSpec

Optimization ladder

Stage Kernel What you learn
1 Naive Thread-to-output mapping and baseline cost
2 Tiled Shared-memory blocking and data reuse
3 Bank-Free Padding away 32-way bank conflicts
4 Double Buffer Latency hiding through staged tiles
5 Tensor Core WMMA usage with a guarded fallback path

What is inside the repository

Surface Purpose
src/ CUDA kernels, benchmark entry point, and utilities
tests/ Google Test verification against cuBLAS
docs/ Learning-oriented technical documentation
openspec/ Stable requirements, workflow, and change history

How to use it

1. Build and run

1
2
3
4
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
./build/bin/sgemm_benchmark -a
ctest --test-dir build

2. Follow the learning route

3. Inspect the project rules


Validation boundary

  • Local GPU machine: runtime verification and benchmarking
  • Hosted CI: formatting, compilation, OpenSpec/repository checks, and Pages

That split keeps the repository honest without pretending GitHub-hosted runners can replace a real CUDA runtime environment.


Explore next

📘 Benchmark results 🏗️ Architecture overview ⭐ View on GitHub


Table of contents