RFC 0007: Half-Precision (FP16) GEMM Support

Status

Status: Accepted Created: 2024 Last Updated: 2024

Overview

Add FP16 (half-precision) GEMM kernel support to leverage Tensor Cores on Volta+ GPUs and reduce memory bandwidth by 50% compared to FP32.

Motivation

Tensor Core acceleration: FP16 enables Tensor Core usage on CC 7.0+ GPUs
Memory bandwidth: 50% reduction vs FP32 (2 bytes vs 4 bytes per element)
Power efficiency: FP16 arithmetic consumes less energy
Model compatibility: Many modern models trained in FP16 or mixed precision

GPU Support Matrix

Architecture	CC	Tensor Core	FP16 Performance
Volta	7.0	Yes (1st gen)	4x FP32
Turing	7.5	Yes (2nd gen)	4x FP32
Ampere	8.0, 8.6	Yes (3rd gen)	4x FP32
Ada Lovelace	8.9	Yes (4th gen)	4x FP32
Hopper	9.0	Yes (5th gen)	4x FP32

Design

FP16 GEMM Kernel

// Basic FP16 GEMM (no Tensor Cores, scalar FP16 arithmetic)
__global__ void half_gemm(
    const half* A, const half* B, half* C,
    int M, int N, int K
);

// FP16 GEMM with FP32 accumulation (improved accuracy)
__global__ void half_gemm_fp32_accum(
    const half* A, const half* B, float* C,
    int M, int N, int K
);

Memory Layout

FP16 tensors use the same layout as FP32, just 2 bytes per element:

FP32: [float0][float1][float2]...  (4 bytes each)
FP16: [half0][half1][half2]...     (2 bytes each)

Type Conversion API

// FP32 → FP16 conversion
void fp32_to_fp16(const float* src, half* dst, size_t elements);

// FP16 → FP32 conversion
void fp16_to_fp32(const half* src, float* dst, size_t elements);

// Mixed-precision GEMM wrapper
void mixed_precision_gemm(
    const float* A_fp32, const float* B_fp32, float* C_fp32,
    int M, int N, int K,
    cudaStream_t stream = 0
);

Accuracy Considerations

Operation	FP16 Error	FP32 Error	Notes
Addition	2^-11	2^-24	Half precision
Multiplication	2^-11	2^-24	Half precision
Accumulation (FP32)	2^-24	2^-24	FP32 accumulator
Accumulation (FP16)	2^-8	2^-24	Accumulated rounding

Recommendation: Always use FP32 accumulation for GEMM to maintain accuracy.

Performance Targets

Configuration	FP32 TFLOPS	FP16 TFLOPS	Speedup
RTX 3090 (Ampere)	~15	~60	4x
RTX 4090 (Ada)	~20	~80	4x
A100 (Ampere)	~20	~80	4x

Kernel Design Strategy

Phase 1: Basic FP16 Kernel (Current)

Scalar FP16 arithmetic
FP32 accumulation for accuracy
Shared memory tiling with FP16 tiles

Phase 2: Tensor Core Acceleration (Future)

WMMA (Warp-level Matrix Multiply Accumulate) API
16x16x16 tile size for Tensor Cores
Requires CC 7.0+

Implementation Files

include/half_gemm.cuh - FP16 kernel declarations and helpers
src/half_gemm.cu - FP16 GEMM kernel implementation
tests/test_half_gemm.cpp - FP16 GEMM tests (if added)

Error Handling

Condition	Behavior
FP16 overflow (>65504)	Clamp to max FP16 value
FP16 underflow (<6.1e-5)	Flush to zero
NaN in FP16 input	Propagate NaN to output
Unsupported GPU	Fall back to FP32 kernel

Testing Strategy

Conversion accuracy: Round-trip FP32→FP16→FP32 within budget
GEMM correctness: FP16 GEMM matches FP32 within 2^-8 tolerance
Performance: FP16 achieves ≥2x speedup over FP32 on supported GPUs
Edge cases: Overflow, underflow, NaN propagation
Compatibility: Correct fallback on pre-Volta GPUs