Architecture Overview

This section is the normative map of the SGEMM system. It records: what each component does, how data moves through the memory hierarchy, when kernel selection logic fires, and where the system's invariants and boundaries lie. Read it before trusting any benchmark claim or interview statement about this project.

Kernel ladder diagram showing naive FP32, tiled FP32, bank-free FP32, double buffer, and Tensor Core WMMA with attached architecture, validation, and research rails.

The ladder is a map of bottleneck shifts, not a trophy rack. Each stage exists because the previous one exposed a new limit.

Technical thesis

SGEMM optimization on modern NVIDIA hardware is a sequence of bottleneck-class migrations. Moving from naïve FP32 to Tensor Core WMMA requires solving four distinct problems in order: DRAM saturation, shared-memory bank conflicts, staging–compute overlap, and WMMA hardware constraints. This repository structures its kernel implementations to surface one problem per stage, hold the prior stages fixed, and make the performance effect of each architectural decision independently observable.

Component inventory

Component	Layer	Primary responsibility
`src/main.cu`	Driver	Entry point that delegates CLI parsing and benchmark orchestration
`src/cli_parser.cuh`	Driver support	Parses mode flags, shapes, and runtime options into `BenchmarkConfig`
`src/benchmark_runner.cuh`	Driver support	Runs configured benchmark and verification flows through one binary
`src/kernels/naive_sgemm.cuh`	Kernel	FP32 baseline with full global-memory load cost
`src/kernels/tiled_sgemm.cuh`	Kernel	Cooperative tile load into shared memory; SMEM reuse
`src/kernels/bank_conflict_free_sgemm.cuh`	Kernel	Padding eliminates shared-memory bank conflicts
`src/kernels/double_buffer_sgemm.cuh`	Kernel	Overlaps next-tile staging with active compute
`src/kernels/tensor_core_sgemm.cuh`	Kernel	WMMA fragment accumulation on hardware-aligned tiles
`src/utils/cuda_utils.cuh`	Utility	CUDA error macros, RAII device-memory wrappers, device metadata
`src/utils/verify.cuh`	Utility	cuBLAS-backed correctness verification and tolerance policy
`tests/test_sgemm.cu`	Test	GPU-side correctness verification under cuBLAS reference tolerance

Memory-hierarchy data flow

Each kernel optimization step corresponds to a change in which memory level dominates access cost:

Naïve:       [Global memory]  → registers         (DRAM-bound)
Tiled:       [Global memory]  → SMEM → registers  (SMEM reuse, conflict risk)
Bank-Free:   [Global memory]  → SMEM+pad → regs   (conflict-free, latency exposed)
Dbl-Buffer:  [Global memory]  → double-SMEM → regs (staging hidden behind compute)
Tensor Core: [Global memory]  → SMEM → WMMA frags  (hardware-accelerated accumulation)

The memory-flow page makes this concrete: addresses, strides, tile dimensions, and the exact load patterns used at each stage.

Design invariants

These properties are held constant across all kernel stages and are part of the architecture's correctness contract:

Row-major layout throughout. All matrices A, B, and C use row-major storage. No kernel silently assumes column-major order.
Float4 granularity for vectorized loads. Kernels that benefit from wider loads use float4 to maximize per-instruction memory bandwidth.
Fallback on unsatisfied constraints. The Tensor Core entry path falls back to the FP32 path when shape guards are not met. Benchmark numbers are never reported from a fallback-activated run.
Epsilon-bounded correctness. Test harnesses verify outputs against a cuBLAS reference with a per-element tolerance of 1e-3. Kernel correctness is not assumed; it is measured.
Timing outside CUDA graph bounds. Benchmark timing wraps the full device call including synchronization. Cold-start and warm-up behavior is documented per benchmark result.

Kernel selection and fallback logic

The entry path in src/main.cu selects a kernel tier based on device capability queries and matrix dimension checks:

Condition	Path selected
Any GPU, any shape	FP32 ladder (naïve → double buffer)
SM ≥ 7.0, shape divisible by WMMA tile	Tensor Core WMMA path
SM ≥ 7.0, shape not WMMA-aligned	FP32 path (fallback)
SM < 7.0	FP32 path (fallback)

The pure benchmark invokes the Tensor Core kernel directly on a pre-validated shape. The safe entry path uses runtime guards.

Architectural decisions

1. The ladder, not the bag of tricks

Each kernel solves one bottleneck class:

Naïve establishes the arithmetic-intensity bound and exposes DRAM saturation.
Tiled moves data into shared memory cooperatively, exposing bank conflict as the next limit.
Bank-Free pads shared-memory arrays to remove conflict, exposing staging latency.
Double Buffer overlaps next-tile staging with active compute to reduce stall cycles.
Tensor Core uses hardware-fused matrix accumulation under strict alignment and device constraints.

The point is that each stage has a single reason to exist and a single architectural effect that can be measured.

2. Validation as an architectural first class

The project separates what two different environments can prove:

Claim	Local CUDA GPU	Hosted CI
Compilation succeeds	✓	✓
Output correctness vs. cuBLAS	✓	✗
Benchmark performance claims	✓	✗
Repository structure and docs	✓	✓
VitePress Pages buildability	✓	✓

This is not just process hygiene. It affects which claims a reader can trust from CI green status alone.

3. Tensor Core as an explicit fast path

The FP32 ladder and the Tensor Core path are independent tiers. The repository exposes both so that:

Benchmark claims for WMMA are only made on aligned shapes with SM ≥ 7.0 devices.
The FP32 ladder remains a complete, self-contained teaching path that does not require Tensor Core hardware.
Fallback behavior is tested, not assumed.

System map and reading paths

Need	Go to
Full component and data-flow diagram	System Blueprint
Kernel-by-kernel explanation of bottleneck shifts	Kernel Ladder
Memory hierarchy and load-pattern analysis	Memory Flow
WMMA selection, shape guards, fallback	Tensor Core Path
Ordered teaching path, interview framing	Academy
Correctness policy and benchmark scope	Validation
External references and comparisons	Research

Fast reviewer path

This page: architectural thesis and invariants.
System Blueprint: full component inventory with data flow.
Validation Overview: what the evidence can and cannot prove.
Benchmark Results: numbers with scope attached.
Academy: the ordered explanation for interview defense.

Architecture Overview ​

Technical thesis ​

Component inventory ​

Memory-hierarchy data flow ​

Design invariants ​

Kernel selection and fallback logic ​

Architectural decisions ​

1. The ladder, not the bag of tricks ​

2. Validation as an architectural first class ​

3. Tensor Core as an explicit fast path ​

System map and reading paths ​

Fast reviewer path ​