FP8 Best Practices

This repository’s FP8 path is useful, but it is not a universal replacement for higher-precision execution.

Where FP8 fits well

  • matrix-multiplication-heavy inference paths,
  • projection layers where moderate quantization error is acceptable,
  • memory-sensitive inference workloads.

Where to stay cautious

  • normalization steps,
  • numerically fragile output heads,
  • any path where you have not compared against an FP16 or BF16 baseline.

Prefer explicit validation

When you introduce FP8 into a new workload, compare against a higher-precision baseline:

import torch
from triton_ops import fp8_gemm

a = torch.randn(256, 512, device="cuda", dtype=torch.float16) * 0.02
b = torch.randn(512, 256, device="cuda", dtype=torch.float16) * 0.02

fp16_out = torch.matmul(a, b)
fp8_out = fp8_gemm(a, b)

rel_error = (fp8_out.float() - fp16_out.float()).abs() / (fp16_out.float().abs() + 1e-6)
print(rel_error.mean().item())

Automatic vs explicit quantization

Automatic quantization

Use:

out = fp8_gemm(a, b)

This is the shortest path and is usually a good starting point.

Explicit quantization

Use:

from triton_ops import quantize_fp8, fp8_gemm

a_fp8, a_scale = quantize_fp8(a)
b_fp8, b_scale = quantize_fp8(b)
out = fp8_gemm(a_fp8, b_fp8, a_scale, b_scale)

This is useful when:

  • you want to reuse quantized tensors,
  • you want visibility into scale values,
  • you want to control when quantization happens.

Overflow handling

The overflow helper is not part of the root-package export list.

from triton_ops.kernels.fp8_quantize import quantize_fp8_with_overflow_handling

Use it when you expect extreme ranges and want retry-based scale reduction before failing with NumericalOverflowError.

Integration advice for FP8Linear

FP8Linear caches quantized weights after the first forward. That is a strong fit for inference-oriented code, but it means you should be careful about using it in training loops where weights continue to update.

Rule of thumb

  • Keep the numerically sensitive boundaries in higher precision.
  • Use FP8 where the memory and throughput trade-off is actually paying off.
  • Always measure the model- or workload-level impact, not only the isolated kernel result.