Kernel Specification

Version: 2.1.0 Last Updated: 2026-04-16 Status: Complete

Purpose

Define the functional requirements for SGEMM kernel implementations covering five progressive optimization techniques from naive triple-loop to Tensor Core WMMA acceleration.

Requirements

Requirement: Kernel Implementations

The project SHALL implement five CUDA SGEMM kernel variants with progressive optimization techniques.

Scenario: Five kernel variants available

  • WHEN a user builds the project
  • THEN five kernel implementations SHALL be available: Naive, Tiled, Bank-Free, Double-Buffer, and Tensor Core

Requirement: Correctness Verification

All kernels SHALL match cuBLAS reference output within specified tolerances.

Scenario: Standard FP32 kernels correctness

  • WHEN any standard FP32 kernel (Naive, Tiled, Bank-Free, Double-Buffer) is executed
  • THEN the output SHALL match cuBLAS with rtol=1e-3, atol=1e-4

Scenario: Tensor Core kernel correctness

  • WHEN the Tensor Core kernel is executed with aligned dimensions
  • THEN the output SHALL match cuBLAS with rtol=5e-2, atol=1e-2

Detailed Requirements

REQ-KERNEL-001: Kernel Implementations

Status: Active Priority: High Source: FR-1 (Product Requirements)

The project SHALL implement five CUDA SGEMM kernel variants:

ID Requirement Status
REQ-KERNEL-001.1 Naive Kernel: Basic triple-loop, one output per thread Complete
REQ-KERNEL-001.2 Tiled Kernel: Shared memory blocking for data reuse Complete
REQ-KERNEL-001.3 Bank-Free Kernel: Padding to eliminate bank conflicts Complete
REQ-KERNEL-001.4 Double-Buffer Kernel: Dual buffers for compute/memory overlap Complete
REQ-KERNEL-001.5 Tensor Core Kernel: WMMA API for FP16→FP32 Complete

Acceptance Criteria:

  • All 5 kernel implementations complete and functional
  • Each kernel in separate file under src/kernels/

REQ-KERNEL-002: Correctness Verification

Status: Active Priority: High Source: FR-2 (Product Requirements)

All kernels must match cuBLAS reference output.

Tolerances:

Kernel Type Relative Tolerance (rtol) Absolute Tolerance (atol)
Standard FP32 (Naive, Tiled, Bank-Free, Double-Buffer) 1e-3 1e-4
Tensor Core (FP16→FP32 mixed precision) 5e-2 1e-2

Acceptance Criteria:

  • All kernels pass cuBLAS comparison
  • Property-based tests cover 100+ dimension combinations

REQ-KERNEL-003: Performance Benchmark

Status: Active Priority: Medium Source: FR-3 (Product Requirements)

The project shall provide benchmarking infrastructure to measure and compare kernel performance.

Acceptance Criteria:

  • CUDA Events-based timing for accurate GPU measurement
  • GFLOPS calculation and reporting
  • Performance comparison against cuBLAS baseline
  • Roofline model data export for performance analysis

REQ-KERNEL-004: Build System Support

Status: Active Priority: High Source: FR-4 (Product Requirements)

Acceptance Criteria:

  • CMake build system (primary, recommended)
  • Makefile for quick local builds
  • Multi-GPU architecture support: sm_70, sm_75, sm_80, sm_86, sm_89, sm_90

Constraints

CON-KERNEL-001: CUDA Compatibility

  • Target: CUDA 11.0+
  • Compute Capability: 7.0+ (Volta through Hopper)
  • Memory: Must operate within GPU memory limits

CON-KERNEL-002: Build Systems

1
2
3
4
5
6
# CMake (primary)
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

# Makefile (quick)
make GPU_ARCH=sm_86

CON-KERNEL-003: Educational Focus

  • Code MUST be readable and well-commented
  • Each optimization level MUST be independently compilable
  • Progressive complexity for learning purposes

Benchmark Configuration

Default benchmark matrix includes:

  • Aligned square cases: 512×512×512, 1024×1024×1024
  • Aligned non-square case: 256×384×640
  • Unaligned edge case: 511×513×1025 (exercises Tensor Core fallback path)

Performance Expectations

Stage Kernel Technique Expected Speedup
1 Naive Baseline
2 Tiled Shared memory blocking ~1.2-1.5×
3 Bank-Free Bank conflict elimination ~1.1× over Tiled
4 Double-Buffer Compute/memory overlap ~1.1× over Bank-Free
5 Tensor Core WMMA API ~3-4× over Double-Buffer

References