Project Guide

This is the orientation surface for the SGEMM whitepaper site. Use it to understand the project's positioning, intended audience, and recommended reading order before entering the deeper sections.

What this project is

This repository is a CUDA SGEMM study organized around a five-stage kernel optimization ladder:

Naïve FP32 — baseline cost model, no shared-memory reuse
Tiled FP32 — shared-memory staging, arithmetic intensity rises with tile size
Bank-Free FP32 — padding eliminates avoidable bank conflicts
Double Buffer — overlapped staging and compute hides memory latency
Tensor Core WMMA — hardware fragment accumulation, guarded by device capability and shape constraints

The goal is not to produce the fastest possible SGEMM implementation. The goal is to show how an optimization argument is built, bounded, and defended — in a form that is readable under interview pressure and auditable by an experienced CUDA engineer.

Who this site is for

Reader	Best first page	Time
Interviewer auditing system clarity	Architecture Overview	8 min
Candidate preparing a walkthrough	Academy Overview	5 min then follow Learning Path
CUDA learner starting fresh	Here, then Architecture, then Academy	Self-paced
Performance skeptic	Validation Overview	12 min
Research-minded reader	Research Desk	Self-paced

See Reader Map for a full depth-tiered navigation index.

How the site is structured

Each section has one primary job. This is deliberate: a page with two jobs is a page that does neither well.

Section	Primary job	What it is not
Overview	Orientation and reading strategy	Not a replacement for the architecture section
Architecture	System map, bottlenecks, invariants	Not a code walkthrough
Academy	Ordered study of the optimization ladder	Not a reference manual
Validation	Correctness and benchmark trust boundary	Not a performance claim
Research	References, related work, evolution notes	Not an extended bibliography

Fast reading plans

Reviewer path (20 min)

Candidate path (30 min)

Builder path (self-paced)

What makes this site a whitepaper, not a portfolio

Most project documentation describes what was built. This site argues why each architectural decision was made, what evidence constrains the claims, and where the reasoning stops.

That distinction matters when an interviewer asks: "Why does the bank-free kernel exist? What does it actually improve?" A portfolio answer is: "It's faster." A whitepaper answer is: "Shared-memory bank conflicts serialize access when multiple threads map to the same bank. Padding the tile layout by one element shifts each column to a different bank, eliminating the multi-way conflict. The improvement is real on conflict-prone shapes and measurable on the test hardware; it is not universal."

That is the register this site aims for across all five kernel stages, across both architecture and validation surfaces.

Project Guide ​

What this project is ​

Who this site is for ​

How the site is structured ​

Fast reading plans ​

Reviewer path (20 min) ​

Candidate path (30 min) ​

Builder path (self-paced) ​

What makes this site a whitepaper, not a portfolio ​