CUDA SGEMM ENGINEERING NOTEBOOK
A bilingual CUDA SGEMM case study built for two outcomes: solid learning depth and strong interview storytelling. Every optimization step is tied to correctness constraints, benchmark evidence, and explicit validation boundaries.
Get from clone to benchmark execution with clear local-vs-CI expectations.
Understand what each stage changes in memory behavior and performance profile.
Use a concise storyline from architecture choices to measurable outcomes.
Trace implementation choices to official docs, papers, and high-quality repos.
What differentiates this repository from many SGEMM demos, with proof-oriented framing.
A practical script for explaining architecture, benchmark trust, and trade-offs under pressure.
Curated papers, official docs, and repositories mapped to concrete design decisions.
A diagnosis loop for bottleneck classification, hypothesis design, and measurable experiments.
Architecture-specific tuning priorities for Volta, Turing, Ampere, Ada, and Hopper.
Coalescing, shared-memory banks, occupancy hints, and profiler-oriented reading checklist.
# Build
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
# Validate
ctest --test-dir build
openspec validate --all
# Benchmark
./build/bin/sgemm_benchmark -a
./build/bin/sgemm_benchmark --dims 256 384 640