Benchmark Results
Representative performance notes, not a universal promise
Read this first
Performance depends on GPU model, CUDA version, clocks, thermals, and matrix shape. Treat the numbers below as a reference snapshot that helps explain the optimization ladder, not as a guarantee for every machine.
Reference snapshot
Sample numbers from an RTX 3060 Laptop at 1024 x 1024 x 1024:
| Kernel | GFLOPS | vs cuBLAS |
|---|---|---|
| cuBLAS | 5727 | 100.0% |
| Tensor Core | 2300 | 40.2% |
| Tiled | 753 | 13.1% |
| Double Buffer | 701 | 12.2% |
| Bank-Free | 673 | 11.8% |
| Naive | 604 | 10.6% |
What matters more than the exact number
| Transition | Main lesson |
|---|---|
| Naive -> Tiled | Shared-memory reuse matters immediately |
| Tiled -> Bank-Free | Memory layout details can remove hidden bottlenecks |
| Bank-Free -> Double Buffer | Overlap and staging help when memory stalls dominate |
| Double Buffer -> Tensor Core | Specialized hardware changes the ceiling dramatically |
Tensor Core note
The benchmark reports:
- WMMA end-to-end: includes conversion and fallback handling
- WMMA compute-only: shown only when
M,K, andNare multiples of 16
When the dimensions are not Tensor Core friendly, the implementation falls back to a safer FP32 path instead of forcing WMMA.
Reproduce on your machine
1
2
3
4
5
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
./build/bin/sgemm_benchmark -a
./build/bin/sgemm_benchmark --dims 256 384 640
If you want longer measurements:
1
./build/bin/sgemm_benchmark -a --warmup 10 --benchmark 50