Validation
This section explains why the repository's performance claims are trustworthy, and where that trust stops.
Methodology covers how we optimize. Validation covers the evidence boundary around those optimizations: correctness thresholds, benchmark scope labels, hosted-CI limits, and reproducibility expectations.
The validation model
| Evidence surface | What it proves | What it does not prove |
|---|---|---|
| Hosted CI + local structural checks | Docs/spec structure, Pages fitness, formatting/governance workflows, and repository health checks | GPU runtime correctness, CUDA benchmark numbers, or hardware-specific speedups |
Local ctest --test-dir build on a GPU machine | Runtime correctness against the project's cuBLAS oracle | Universal performance claims |
| Local benchmark execution | Performance behavior on a named GPU, under a named command and scope label | Results on other GPUs, other CUDA stacks, or unlabeled workloads |
Canonical validation pages
| Need | Page |
|---|---|
| Understand correctness thresholds and oracle policy | Correctness Policy |
| Interpret benchmark labels and reported numbers | Benchmark Scope |
| Reproduce a result responsibly | Reproducibility |
| See a representative snapshot of results | Benchmark Results |
What hosted CI proves
Hosted CI is trusted to prove repository health: documentation structure, Pages buildability, formatting checks, and OpenSpec/governance alignment. It keeps the public surface coherent.
Hosted CI is not trusted to prove CUDA runtime behavior or benchmark performance. Those claims require a real GPU machine.
What only local GPU runs can prove
Local GPU runs are required for:
- cuBLAS-backed runtime correctness checks
- Tensor Core fast-path versus fallback behavior
- benchmark numbers, including end-to-end versus compute-only differences
- architecture-specific conclusions about occupancy, staging, and memory behavior
How to read published numbers
Treat every number in this repository as scoped evidence, not a universal promise.
- Read the GPU model and CUDA context first.
- Read the benchmark label second.
- Read the shape set third.
- Only then compare the number to another result.
If any of those fields are missing, the number is a hint, not a claim.
Common presentation mistakes
- Claiming “Tensor Core is always faster” without shape, conversion, and fallback caveats.
- Quoting one GFLOPS number without its benchmark label or workload scope.
- Ignoring the numerical-tolerance difference between FP32 and mixed precision.
- Treating hosted CI success as proof of CUDA runtime correctness or performance.