Getting Started
Build, run, and validate the project without guessing the toolchain
Requirements
| Item | Requirement |
|---|---|
| GPU | NVIDIA Volta (sm_70) or newer |
| CUDA Toolkit | 11.0+ |
| CMake | 3.18+ |
| Host compiler | GCC 9+ or Clang 10+ |
Tensor Core benchmarks require sm_70+ and dimensions aligned to 16. The code still runs on the guarded FP32 path when those conditions are not met.
Recommended build flow
1
2
3
4
5
git clone https://github.com/LessUp/sgemm-optimization.git
cd sgemm-optimization
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
Run the default benchmark:
1
./build/bin/sgemm_benchmark
Run the broader benchmark set:
1
./build/bin/sgemm_benchmark -a
Run tests:
1
ctest --test-dir build
Choosing CUDA architectures
By default:
- CMake 3.24+ uses
native - older CMake falls back to the repository’s explicit architecture list
If you want to override it, use CMake’s native variable:
1
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_ARCHITECTURES=86
Quick local Makefile flow:
1
2
3
make GPU_ARCH=sm_86
make benchmark
make test
Validation boundary
| Environment | What to run |
|---|---|
| Local GPU machine | benchmark, runtime verification, ctest |
| Hosted CI | formatting, compile validation, OpenSpec/repository checks, Pages |
This split is intentional: GitHub-hosted runners validate repository health, while performance and CUDA runtime correctness still require a real GPU machine.
Useful commands
1
2
3
4
5
6
7
8
# one explicit benchmark case
./build/bin/sgemm_benchmark --dims 256 384 640
# longer benchmark run
./build/bin/sgemm_benchmark -a --warmup 10 --benchmark 50
# OpenSpec validation
openspec validate --all