CUDA SGEMM ENGINEERING NOTEBOOK

SGEMM Optimization Lab

A bilingual CUDA SGEMM case study built for two outcomes: solid learning depth and strong interview storytelling. Every optimization step is tied to correctness constraints, benchmark evidence, and explicit validation boundaries.

Start in 5 minutes See project highlights Interview playbook GitHub

cuBLAS-verifiedOpenSpec-governedEN / ZH mirrored

Kernel Ladder

naive -> tiled -> bank-free -> double-buffer -> WMMA

Correctness Oracle

cuBLAS

separate tolerances for FP32 and Tensor Core paths

Validation Boundary

CI + GPU

hosted CI for build health, local GPU for runtime and performance

Public Surfaces

EN / 中文

mirrored pages for tutorial, interview, and references

Benchmark Scope

End-to-end and compute-only WMMA are reported separately.

Numerical Policy

FP32 and Tensor Core paths use different tolerance budgets by design.

Engineering Contract

Unified launcher signature keeps kernels swappable and testable.

Governance

OpenSpec keeps docs, process, and implementation intent aligned.

Why this repository is worth attention

Learning Depth

Progressive

Each kernel stage teaches one specific performance concept.

Evidence Model

Traceable

Speedup claims are attached to correctness checks and scope labels.

Interview Utility

Practical

The project can be explained as a clear engineering decision chain.

Community Value

Reusable

Includes playbooks, references, and architecture-aware tuning guidance.

Project map in one diagram

Choose your route

Build and run quickly

Get from clone to benchmark execution with clear local-vs-CI expectations.

Getting Started Benchmark Results

Learn the optimization ladder

Understand what each stage changes in memory behavior and performance profile.

Learning Path Kernel Series

Prepare interview narrative

Use a concise storyline from architecture choices to measurable outcomes.

Project Highlights Interview Playbook

Validate technical lineage

Trace implementation choices to official docs, papers, and high-quality repos.

References Optimization Playbook

Knowledge hub

Project Highlights

What differentiates this repository from many SGEMM demos, with proof-oriented framing.

Interview Playbook

A practical script for explaining architecture, benchmark trust, and trade-offs under pressure.

References

Curated papers, official docs, and repositories mapped to concrete design decisions.

Optimization Playbook

A diagnosis loop for bottleneck classification, hypothesis design, and measurable experiments.

Performance Casebook

Architecture-specific tuning priorities for Volta, Turing, Ampere, Ada, and Hopper.

CUDA Memory Cheat Sheet

Coalescing, shared-memory banks, occupancy hints, and profiler-oriented reading checklist.

Command cockpit

bash

# Build
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

# Validate
ctest --test-dir build
openspec validate --all

# Benchmark
./build/bin/sgemm_benchmark -a
./build/bin/sgemm_benchmark --dims 256 384 640

Language and entry points

Chinese mirrored home: 中文首页
Repository entry: README
OpenSpec source of truth: openspec/specs