FP8 Best Practices

This repository’s FP8 path is useful, but it is not a universal replacement for higher-precision execution.

Where FP8 fits well

matrix-multiplication-heavy inference paths,
projection layers where moderate quantization error is acceptable,
memory-sensitive inference workloads.

Where to stay cautious

normalization steps,
numerically fragile output heads,
any path where you have not compared against an FP16 or BF16 baseline.

Prefer explicit validation

When you introduce FP8 into a new workload, compare against a higher-precision baseline:

import torch
from triton_ops import fp8_gemm

a = torch.randn(256, 512, device="cuda", dtype=torch.float16) * 0.02
b = torch.randn(512, 256, device="cuda", dtype=torch.float16) * 0.02

fp16_out = torch.matmul(a, b)
fp8_out = fp8_gemm(a, b)

rel_error = (fp8_out.float() - fp16_out.float()).abs() / (fp16_out.float().abs() + 1e-6)
print(rel_error.mean().item())

Automatic vs explicit quantization

Automatic quantization

Use:

out = fp8_gemm(a, b)

This is the shortest path and is usually a good starting point.

Explicit quantization

Use:

from triton_ops import quantize_fp8, fp8_gemm

a_fp8, a_scale = quantize_fp8(a)
b_fp8, b_scale = quantize_fp8(b)
out = fp8_gemm(a_fp8, b_fp8, a_scale, b_scale)

This is useful when:

you want to reuse quantized tensors,
you want visibility into scale values,
you want to control when quantization happens.

Overflow handling

The overflow helper is not part of the root-package export list.

from triton_ops.kernels.fp8_quantize import quantize_fp8_with_overflow_handling

Use it when you expect extreme ranges and want retry-based scale reduction before failing with NumericalOverflowError.

Integration advice for `FP8Linear`

FP8Linear caches quantized weights after the first forward. That is a strong fit for inference-oriented code, but it means you should be careful about using it in training loops where weights continue to update.

Rule of thumb

Keep the numerically sensitive boundaries in higher precision.
Use FP8 where the memory and throughput trade-off is actually paying off.
Always measure the model- or workload-level impact, not only the isolated kernel result.