RFC 0007: Half-Precision (FP16) GEMM Support
Status
Status: Accepted
Created: 2024
Last Updated: 2024
Overview
Add FP16 (half-precision) GEMM kernel support to leverage Tensor Cores on Volta+ GPUs and reduce memory bandwidth by 50% compared to FP32.
Motivation
- Tensor Core acceleration: FP16 enables Tensor Core usage on CC 7.0+ GPUs
- Memory bandwidth: 50% reduction vs FP32 (2 bytes vs 4 bytes per element)
- Power efficiency: FP16 arithmetic consumes less energy
- Model compatibility: Many modern models trained in FP16 or mixed precision
GPU Support Matrix
| Architecture |
CC |
Tensor Core |
FP16 Performance |
| Volta |
7.0 |
Yes (1st gen) |
4x FP32 |
| Turing |
7.5 |
Yes (2nd gen) |
4x FP32 |
| Ampere |
8.0, 8.6 |
Yes (3rd gen) |
4x FP32 |
| Ada Lovelace |
8.9 |
Yes (4th gen) |
4x FP32 |
| Hopper |
9.0 |
Yes (5th gen) |
4x FP32 |
Design
FP16 GEMM Kernel
1
2
3
4
5
6
7
8
9
10
11
| // Basic FP16 GEMM (no Tensor Cores, scalar FP16 arithmetic)
__global__ void half_gemm(
const half* A, const half* B, half* C,
int M, int N, int K
);
// FP16 GEMM with FP32 accumulation (improved accuracy)
__global__ void half_gemm_fp32_accum(
const half* A, const half* B, float* C,
int M, int N, int K
);
|
Memory Layout
FP16 tensors use the same layout as FP32, just 2 bytes per element:
1
2
| FP32: [float0][float1][float2]... (4 bytes each)
FP16: [half0][half1][half2]... (2 bytes each)
|
Type Conversion API
1
2
3
4
5
6
7
8
9
10
11
12
| // FP32 → FP16 conversion
void fp32_to_fp16(const float* src, half* dst, size_t elements);
// FP16 → FP32 conversion
void fp16_to_fp32(const half* src, float* dst, size_t elements);
// Mixed-precision GEMM wrapper
void mixed_precision_gemm(
const float* A_fp32, const float* B_fp32, float* C_fp32,
int M, int N, int K,
cudaStream_t stream = 0
);
|
Accuracy Considerations
| Operation |
FP16 Error |
FP32 Error |
Notes |
| Addition |
2^-11 |
2^-24 |
Half precision |
| Multiplication |
2^-11 |
2^-24 |
Half precision |
| Accumulation (FP32) |
2^-24 |
2^-24 |
FP32 accumulator |
| Accumulation (FP16) |
2^-8 |
2^-24 |
Accumulated rounding |
Recommendation: Always use FP32 accumulation for GEMM to maintain accuracy.
| Configuration |
FP32 TFLOPS |
FP16 TFLOPS |
Speedup |
| RTX 3090 (Ampere) |
~15 |
~60 |
4x |
| RTX 4090 (Ada) |
~20 |
~80 |
4x |
| A100 (Ampere) |
~20 |
~80 |
4x |
Kernel Design Strategy
Phase 1: Basic FP16 Kernel (Current)
- Scalar FP16 arithmetic
- FP32 accumulation for accuracy
- Shared memory tiling with FP16 tiles
Phase 2: Tensor Core Acceleration (Future)
- WMMA (Warp-level Matrix Multiply Accumulate) API
- 16x16x16 tile size for Tensor Cores
- Requires CC 7.0+
Implementation Files
include/half_gemm.cuh - FP16 kernel declarations and helpers
src/half_gemm.cu - FP16 GEMM kernel implementation
tests/test_half_gemm.cpp - FP16 GEMM tests (if added)
Error Handling
| Condition |
Behavior |
| FP16 overflow (>65504) |
Clamp to max FP16 value |
| FP16 underflow (<6.1e-5) |
Flush to zero |
| NaN in FP16 input |
Propagate NaN to output |
| Unsupported GPU |
Fall back to FP32 kernel |
Testing Strategy
- Conversion accuracy: Round-trip FP32→FP16→FP32 within budget
- GEMM correctness: FP16 GEMM matches FP32 within 2^-8 tolerance
- Performance: FP16 achieves ≥2x speedup over FP32 on supported GPUs
- Edge cases: Overflow, underflow, NaN propagation
- Compatibility: Correct fallback on pre-Volta GPUs
References