Project Guide
This is the orientation surface for the SGEMM whitepaper site. Use it to understand the project's positioning, intended audience, and recommended reading order before entering the deeper sections.
What this project is
This repository is a CUDA SGEMM study organized around a five-stage kernel optimization ladder:
- Naïve FP32 — baseline cost model, no shared-memory reuse
- Tiled FP32 — shared-memory staging, arithmetic intensity rises with tile size
- Bank-Free FP32 — padding eliminates avoidable bank conflicts
- Double Buffer — overlapped staging and compute hides memory latency
- Tensor Core WMMA — hardware fragment accumulation, guarded by device capability and shape constraints
The goal is not to produce the fastest possible SGEMM implementation. The goal is to show how an optimization argument is built, bounded, and defended — in a form that is readable under interview pressure and auditable by an experienced CUDA engineer.
Who this site is for
| Reader | Best first page | Time |
|---|---|---|
| Interviewer auditing system clarity | Architecture Overview | 8 min |
| Candidate preparing a walkthrough | Academy Overview | 5 min then follow Learning Path |
| CUDA learner starting fresh | Here, then Architecture, then Academy | Self-paced |
| Performance skeptic | Validation Overview | 12 min |
| Research-minded reader | Research Desk | Self-paced |
See Reader Map for a full depth-tiered navigation index.
How the site is structured
Each section has one primary job. This is deliberate: a page with two jobs is a page that does neither well.
| Section | Primary job | What it is not |
|---|---|---|
| Overview | Orientation and reading strategy | Not a replacement for the architecture section |
| Architecture | System map, bottlenecks, invariants | Not a code walkthrough |
| Academy | Ordered study of the optimization ladder | Not a reference manual |
| Validation | Correctness and benchmark trust boundary | Not a performance claim |
| Research | References, related work, evolution notes | Not an extended bibliography |
Fast reading plans
Reviewer path (20 min)
Candidate path (30 min)
Builder path (self-paced)
What makes this site a whitepaper, not a portfolio
Most project documentation describes what was built. This site argues why each architectural decision was made, what evidence constrains the claims, and where the reasoning stops.
That distinction matters when an interviewer asks: "Why does the bank-free kernel exist? What does it actually improve?" A portfolio answer is: "It's faster." A whitepaper answer is: "Shared-memory bank conflicts serialize access when multiple threads map to the same bank. Padding the tile layout by one element shifts each column to a different bank, eliminating the multi-way conflict. The improvement is real on conflict-prone shapes and measurable on the test hardware; it is not universal."
That is the register this site aims for across all five kernel stages, across both architecture and validation surfaces.