Skip to content

Project Guide

This is the orientation surface for the SGEMM whitepaper site. Use it to understand the project's positioning, intended audience, and recommended reading order before entering the deeper sections.

What this project is

This repository is a CUDA SGEMM study organized around a five-stage kernel optimization ladder:

  1. Naïve FP32 — baseline cost model, no shared-memory reuse
  2. Tiled FP32 — shared-memory staging, arithmetic intensity rises with tile size
  3. Bank-Free FP32 — padding eliminates avoidable bank conflicts
  4. Double Buffer — overlapped staging and compute hides memory latency
  5. Tensor Core WMMA — hardware fragment accumulation, guarded by device capability and shape constraints

The goal is not to produce the fastest possible SGEMM implementation. The goal is to show how an optimization argument is built, bounded, and defended — in a form that is readable under interview pressure and auditable by an experienced CUDA engineer.

Who this site is for

ReaderBest first pageTime
Interviewer auditing system clarityArchitecture Overview8 min
Candidate preparing a walkthroughAcademy Overview5 min then follow Learning Path
CUDA learner starting freshHere, then Architecture, then AcademySelf-paced
Performance skepticValidation Overview12 min
Research-minded readerResearch DeskSelf-paced

See Reader Map for a full depth-tiered navigation index.

How the site is structured

Each section has one primary job. This is deliberate: a page with two jobs is a page that does neither well.

SectionPrimary jobWhat it is not
OverviewOrientation and reading strategyNot a replacement for the architecture section
ArchitectureSystem map, bottlenecks, invariantsNot a code walkthrough
AcademyOrdered study of the optimization ladderNot a reference manual
ValidationCorrectness and benchmark trust boundaryNot a performance claim
ResearchReferences, related work, evolution notesNot an extended bibliography

Fast reading plans

Reviewer path (20 min)

  1. Architecture Overview
  2. Kernel Ladder
  3. Validation Overview
  4. Related Projects

Candidate path (30 min)

  1. Academy Overview
  2. Learning Path
  3. Diagnosis Loop
  4. Evolution Notes

Builder path (self-paced)

  1. Getting Started
  2. System Blueprint
  3. Correctness Policy
  4. Benchmark Scope
  5. Curated References

What makes this site a whitepaper, not a portfolio

Most project documentation describes what was built. This site argues why each architectural decision was made, what evidence constrains the claims, and where the reasoning stops.

That distinction matters when an interviewer asks: "Why does the bank-free kernel exist? What does it actually improve?" A portfolio answer is: "It's faster." A whitepaper answer is: "Shared-memory bank conflicts serialize access when multiple threads map to the same bank. Padding the tile layout by one element shifts each column to a different bank, eliminating the multi-way conflict. The improvement is real on conflict-prone shapes and measurable on the test hardware; it is not universal."

That is the register this site aims for across all five kernel stages, across both architecture and validation surfaces.

MIT Licensed