Skip to content

Benchmark Results

Representative performance notes, not a universal promise

Reference snapshot

Sample numbers from an RTX 3060 Laptop at 1024 x 1024 x 1024:

KernelGFLOPSvs cuBLAS
cuBLAS5727100.0%
Tensor Core (WMMA compute-only)230040.2%
Tiled75313.1%
Double Buffer70112.2%
Bank-Free67311.8%
Naive60410.6%

Tensor Core note

The benchmark reports:

  • WMMA end-to-end: the safe FP32 wrapper, including conversion and fallback handling
  • WMMA compute-only: the pure pre-converted FP16 path, shown only when M, K, and N are multiples of 16

When the dimensions are not Tensor Core friendly, the implementation falls back to a safer FP32 path instead of forcing WMMA.

MIT Licensed