Getting Started

Build, run, and validate the project without guessing the toolchain

Requirements

Item	Requirement
GPU	NVIDIA Volta (`sm_70`) or newer
CUDA Toolkit	11.0+
CMake	3.18+
Host compiler	GCC 9+ or Clang 10+

Tensor Core benchmarks require sm_70+ and dimensions aligned to 16. The code still runs on the guarded FP32 path when those conditions are not met.

Recommended build flow

git clone https://github.com/LessUp/sgemm-optimization.git
cd sgemm-optimization

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

Run the default benchmark:

./build/bin/sgemm_benchmark

Run the broader benchmark set:

./build/bin/sgemm_benchmark -a

Run tests:

ctest --test-dir build

Choosing CUDA architectures

By default:

CMake 3.24+ uses native
older CMake falls back to the repository’s explicit architecture list

If you want to override it, use CMake’s native variable:

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_ARCHITECTURES=86

Quick local Makefile flow:

make GPU_ARCH=sm_86
make benchmark
make test

Validation boundary

Environment	What to run
Local GPU machine	benchmark, runtime verification, `ctest`
Hosted CI	formatting, compile validation, OpenSpec/repository checks, Pages

This split is intentional: GitHub-hosted runners validate repository health, while performance and CUDA runtime correctness still require a real GPU machine.

Useful commands

# one explicit benchmark case
./build/bin/sgemm_benchmark --dims 256 384 640

# longer benchmark run
./build/bin/sgemm_benchmark -a --warmup 10 --benchmark 50

# OpenSpec validation
openspec validate --all