Skip to content

Architecture Overview

This section is the canonical map of the SGEMM system: why the design exists, how data moves, when each kernel strategy appears, and where Tensor Core acceleration is allowed to take over.

Why this design exists

This repository is not organized as “one fast kernel plus a benchmark screenshot.” It is organized as an engineering reasoning chain:

  • start from a readable FP32 baseline
  • expose the next bottleneck instead of hiding it
  • add one architectural idea at a time
  • keep correctness and benchmark scope explicit
  • preserve a safe path when Tensor Core constraints are not met

That structure makes the project useful for learning, review, and interviews: readers can explain why a kernel exists before they talk about how fast it is.

What this repository is trying to prove

  • SGEMM optimization should read as a reasoning chain, not a bag of isolated tricks.
  • Performance claims only count when they stay attached to correctness policy and benchmark scope.
  • Tensor Core acceleration is only persuasive when constraints and fallback behavior are explicit.

System map

LayerResponsibilityWhere to go next
Kernel ladderExplains the optimization chain from naïve FP32 to WMMAKernel Ladder
Memory flowExplains global-memory access, shared-memory reuse, bank conflicts, and double buffering as one data-movement storyMemory Flow
Tensor Core pathExplains WMMA selection, FP32→FP16 staging, shape guards, and fallback behaviorTensor Core Path
Deep kernel pagesExplains each kernel implementation in isolationNaïve, Tiled, Bank-Free, Double Buffer, Tensor Core WMMA

Architectural decisions that shape the repository

1. Optimization is presented as a ladder, not a bag of tricks

Each kernel solves a specific bottleneck class:

  1. Naïve establishes the cost model and exposes poor reuse.
  2. Tiled trades extra coordination for shared-memory reuse.
  3. Bank-Free stabilizes shared-memory access by padding away avoidable conflicts.
  4. Double Buffer overlaps staging and compute to hide part of memory latency.
  5. Tensor Core raises the throughput ceiling, but only under explicit device and shape constraints.

The goal is not “every later kernel must beat every earlier kernel on every GPU.” The goal is that each step has a clear reason to exist and a measurable architectural effect.

2. Data movement is the main system story

SGEMM performance here is framed around where data lives and when it moves:

  • from global memory into the SM
  • from global memory into shared tiles
  • from shared memory into registers or WMMA fragments
  • from staged tiles back into output matrix C

That is why the architecture section treats memory flow as a first-class topic instead of leaving it scattered across per-kernel notes.

3. Tensor Core is an optional fast path, not the only path

The repository exposes both:

  • a safe FP32 entry path that may convert inputs and fall back when WMMA is unsupported
  • a pure compute-only WMMA path used to measure raw Tensor Core behavior under friendly shapes

This keeps benchmark claims honest. Unsupported dimensions are not silently reported as Tensor Core wins.

4. Validation boundaries are part of the architecture

The project deliberately separates what can be trusted in different environments:

AreaLocal CUDA GPUHosted CI
CUDA compilationYesNo
Runtime correctnessYesNo
Benchmark performanceYesNo
Docs, OpenSpec, and repository integrityYesYes
Pages buildabilityOptionalYes

This is not just process documentation. It affects how the architecture is narrated: performance conclusions only count when they are tied back to the correct runtime environment.

  1. Start here for the system map.
  2. Read Kernel Ladder to understand the optimization chain.
  3. Read Memory Flow to understand the data-movement logic behind the ladder.
  4. Read Tensor Core Path before interpreting WMMA benchmark numbers.
  5. Use the existing kernel deep dives when you want implementation detail instead of system rationale.

Fast reviewer path

  1. Read this page for the system claim.
  2. Read Kernel Ladder for the optimization order.
  3. Read Validation Overview before trusting any benchmark claim.
  4. Read Methodology when you need the concise explanation path used in reviews or interviews.
  5. Use Resources Hub to trace external sources and comparison points.

MIT Licensed