RFC 0007: Half-Precision (FP16) GEMM Support

Status

Status: Accepted Created: 2024 Last Updated: 2024

Overview

Add FP16 (half-precision) GEMM kernel support to leverage Tensor Cores on Volta+ GPUs and reduce memory bandwidth by 50% compared to FP32.

Motivation

  1. Tensor Core acceleration: FP16 enables Tensor Core usage on CC 7.0+ GPUs
  2. Memory bandwidth: 50% reduction vs FP32 (2 bytes vs 4 bytes per element)
  3. Power efficiency: FP16 arithmetic consumes less energy
  4. Model compatibility: Many modern models trained in FP16 or mixed precision

GPU Support Matrix

Architecture CC Tensor Core FP16 Performance
Volta 7.0 Yes (1st gen) 4x FP32
Turing 7.5 Yes (2nd gen) 4x FP32
Ampere 8.0, 8.6 Yes (3rd gen) 4x FP32
Ada Lovelace 8.9 Yes (4th gen) 4x FP32
Hopper 9.0 Yes (5th gen) 4x FP32

Design

FP16 GEMM Kernel

1
2
3
4
5
6
7
8
9
10
11
// Basic FP16 GEMM (no Tensor Cores, scalar FP16 arithmetic)
__global__ void half_gemm(
    const half* A, const half* B, half* C,
    int M, int N, int K
);

// FP16 GEMM with FP32 accumulation (improved accuracy)
__global__ void half_gemm_fp32_accum(
    const half* A, const half* B, float* C,
    int M, int N, int K
);

Memory Layout

FP16 tensors use the same layout as FP32, just 2 bytes per element:

1
2
FP32: [float0][float1][float2]...  (4 bytes each)
FP16: [half0][half1][half2]...     (2 bytes each)

Type Conversion API

1
2
3
4
5
6
7
8
9
10
11
12
// FP32 → FP16 conversion
void fp32_to_fp16(const float* src, half* dst, size_t elements);

// FP16 → FP32 conversion
void fp16_to_fp32(const half* src, float* dst, size_t elements);

// Mixed-precision GEMM wrapper
void mixed_precision_gemm(
    const float* A_fp32, const float* B_fp32, float* C_fp32,
    int M, int N, int K,
    cudaStream_t stream = 0
);

Accuracy Considerations

Operation FP16 Error FP32 Error Notes
Addition 2^-11 2^-24 Half precision
Multiplication 2^-11 2^-24 Half precision
Accumulation (FP32) 2^-24 2^-24 FP32 accumulator
Accumulation (FP16) 2^-8 2^-24 Accumulated rounding

Recommendation: Always use FP32 accumulation for GEMM to maintain accuracy.

Performance Targets

Configuration FP32 TFLOPS FP16 TFLOPS Speedup
RTX 3090 (Ampere) ~15 ~60 4x
RTX 4090 (Ada) ~20 ~80 4x
A100 (Ampere) ~20 ~80 4x

Kernel Design Strategy

Phase 1: Basic FP16 Kernel (Current)

  • Scalar FP16 arithmetic
  • FP32 accumulation for accuracy
  • Shared memory tiling with FP16 tiles

Phase 2: Tensor Core Acceleration (Future)

  • WMMA (Warp-level Matrix Multiply Accumulate) API
  • 16x16x16 tile size for Tensor Cores
  • Requires CC 7.0+

Implementation Files

  • include/half_gemm.cuh - FP16 kernel declarations and helpers
  • src/half_gemm.cu - FP16 GEMM kernel implementation
  • tests/test_half_gemm.cpp - FP16 GEMM tests (if added)

Error Handling

Condition Behavior
FP16 overflow (>65504) Clamp to max FP16 value
FP16 underflow (<6.1e-5) Flush to zero
NaN in FP16 input Propagate NaN to output
Unsupported GPU Fall back to FP32 kernel

Testing Strategy

  1. Conversion accuracy: Round-trip FP32→FP16→FP32 within budget
  2. GEMM correctness: FP16 GEMM matches FP32 within 2^-8 tolerance
  3. Performance: FP16 achieves ≥2x speedup over FP32 on supported GPUs
  4. Edge cases: Overflow, underflow, NaN propagation
  5. Compatibility: Correct fallback on pre-Volta GPUs

References


Back to top

MIT License | A learning project for the CUDA community