Context

The project serves as an educational and demonstrative implementation of core GPU optimization techniques in the HPC domain. The architecture must support five distinct kernel implementations while maintaining a clean, testable, and benchmarkable codebase.

Goals / Non-Goals

Goals

  • Support progressive optimization from naive to Tensor Core
  • Maintain consistent kernel interface across all implementations
  • Enable accurate performance benchmarking
  • Ensure numerical correctness verification

Non-Goals

  • Production-grade library (educational focus)
  • Multi-GPU support
  • Other GEMM variants (DGEMM, CGEMM)

Decisions

Three-Layer Architecture

1
2
3
4
5
6
7
8
9
10
11
12
13
14
┌─────────────────────────────────────────────────────────────────┐
│                         main.cu                                  │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐              │
│  │  Benchmark  │  │   Verify    │  │  CLI Parser │              │
│  └──────┬──────┘  └──────┬──────┘  └─────────────┘              │
└─────────┼────────────────┼──────────────────────────────────────┘
          │                │
          ▼                ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Kernel Implementations                        │
│  ┌────────┐ ┌────────┐ ┌────────────┐ ┌─────────────┐ ┌───────┐ │
│  │ Naive  │ │ Tiled  │ │ Bank-Free  │ │ Dbl-Buffer  │ │ TC    │ │
│  └────────┘ └────────┘ └────────────┘ └─────────────┘ └───────┘ │
└─────────────────────────────────────────────────────────────────┘

Unified Kernel Interface

All kernels conform to a unified template interface:

1
2
3
4
5
6
7
8
template<int TILE_SIZE = 32>
void launch_xxx_sgemm(
    const float* A,      // M×K input matrix
    const float* B,      // K×N input matrix
    float* C,            // M×N output matrix
    int M, int K, int N,
    cudaStream_t stream = 0
);

Exception-Based Error Handling

1
2
3
4
5
6
7
8
9
10
11
struct CudaError : std::runtime_error {
    explicit CudaError(const std::string& msg) : std::runtime_error(msg) {}
};

#define CUDA_CHECK(call) \
    do { \
        cudaError_t err = call; \
        if (err != cudaSuccess) { \
            throw CudaError(cudaGetErrorString(err)); \
        } \
    } while(0)

Verification Tolerances

Kernel Type rtol atol
Standard FP32 1e-3 1e-4
Tensor Core 5e-2 1e-2

Risks / Trade-offs

Risk Mitigation
Kernel interface changes would require updating all implementations Interface is stable; use delta specs for any changes
Error handling exceptions may impact performance in hot paths Exceptions only on setup/teardown, not kernel execution