Reproducibility

Reproducibility in this repository means another reader can tell what was run, where it was run, and what kind of claim the result supports.

Minimum local workflow

bash

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
ctest --test-dir build
./build/bin/sgemm_benchmark -a --warmup 10 --benchmark 50

This sequence is the minimum bar for a locally reproduced performance statement.

Record the environment

Every reported run should capture:

GPU model
CUDA toolkit / driver context
benchmark command
dimensions or benchmark set used
warmup count and benchmark count
whether the number is end-to-end or compute-only

Without that metadata, readers cannot tell whether the number is directly comparable to the published snapshot.

Hosted CI versus local reruns

Hosted CI is still valuable because it proves the documentation, Pages, and governance surfaces stay coherent. But CI runners are not the evidence source for runtime behavior.

Only local GPU reruns can confirm:

correctness against cuBLAS on the actual machine
whether Tensor Core fast-path conditions were met
whether a measured gain survives the chosen workload mix

Reporting checklist

Before publishing or repeating a result, make sure you can answer all of these:

Which GPU produced the number?
Which command produced the number?
Which benchmark label applies?
Which correctness run guarded the benchmark?
Which irregular shape prevents the claim from being aligned-only cherry-picking?

If you cannot answer those questions, rerun the experiment before you cite it.

Reproducibility ​

Minimum local workflow ​

Record the environment ​

Hosted CI versus local reruns ​

Reporting checklist ​

Reproducibility

Minimum local workflow

Record the environment

Hosted CI versus local reruns

Reporting checklist