Benchmarking

The benchmark layer is organized around classes and helper functions, not standalone root-level benchmark functions.

`BenchmarkSuite`

BenchmarkSuite(
    warmup_runs: int = 10,
    benchmark_runs: int = 100,
    rtol: float = 1e-3,
    atol: float = 1e-5,
)

Main methods:

benchmark_kernel(...)
compare_with_pytorch(...)
benchmark_rmsnorm_rope(...)
benchmark_gated_mlp(...)
benchmark_fp8_gemm(...)
generate_report(format="text" | "json")
save_report(filepath, format="text" | "json")

Example:

import torch
from triton_ops import BenchmarkSuite, fused_rmsnorm_rope
from triton_ops.kernels.rmsnorm_rope import fused_rmsnorm_rope_reference

suite = BenchmarkSuite(warmup_runs=5, benchmark_runs=20)

x = torch.randn(2, 128, 4096, device="cuda", dtype=torch.float16)
weight = torch.ones(4096, device="cuda", dtype=torch.float16)
cos = torch.randn(128, 64, device="cuda", dtype=torch.float16)
sin = torch.randn(128, 64, device="cuda", dtype=torch.float16)

result = suite.benchmark_kernel(
    fused_rmsnorm_rope,
    fused_rmsnorm_rope_reference,
    "fused_rmsnorm_rope",
    (2, 128, 4096),
    x,
    weight,
    cos,
    sin,
)

`CorrectnessVerifier`

CorrectnessVerifier(rtol: float = 1e-3, atol: float = 1e-5)

Useful methods:

verify(actual, expected) -> tuple[bool, dict]
verify_allclose(actual, expected) -> bool
compute_relative_error(actual, expected) -> float

The detailed verify method returns aggregate statistics such as maximum absolute difference, mean relative difference, and element-count violations.

Standalone correctness helpers

Available from triton_ops.benchmark.correctness:

verify_fp8_accuracy(fp8_result, fp16_baseline, max_relative_error=0.01)
verify_nan_inf_propagation(output, input_has_nan, input_has_inf)

These are useful when you want direct numerical checks outside BenchmarkSuite.

Report objects

triton_ops.benchmark.report defines:

BenchmarkResult
ComparisonResult
PerformanceReport

PerformanceReport can emit:

human-readable text via generate_text_report()
JSON via generate_json_report()

Important accuracy note

This repository’s benchmark utilities are GPU-oriented in the specialized benchmark methods because they allocate tensors on CUDA. For CPU-safe validation of repository health, use the test and build commands rather than the GPU benchmark suite.

For throughput and bandwidth interpretation, see the metric helpers documented in the autotuner page:

compute_gemm_metrics
compute_elementwise_metrics

Benchmarking

BenchmarkSuite

CorrectnessVerifier

Standalone correctness helpers

Report objects

Important accuracy note

Related metric helpers

`BenchmarkSuite`

`CorrectnessVerifier`