Architecture
What the repository contains, and where validation responsibility lives
System shape
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
main.cu
├── benchmark orchestration
├── verification flow
└── CLI argument handling
src/kernels/
├── naive
├── tiled
├── bank-conflict-free
├── double-buffer
└── tensor-core
src/utils/
├── CUDA RAII and error handling
├── benchmark helpers
└── verification helpers
tests/
└── Google Test coverage against cuBLAS
Repository surfaces
| Surface | Role |
|---|---|
README.md |
Repository entry point and quick-start |
index.md + docs/ |
Public landing page and learning-oriented documentation |
openspec/specs/ |
Stable authoritative requirements and governance |
openspec/changes/ |
Active implementation plans and delta specs |
.github/workflows/ |
CI-safe validation and Pages deployment |
The repository intentionally separates public explanation from normative process. OpenSpec governs; docs teach; README introduces.
Kernel contract
All kernel launchers follow the same shape:
1
2
3
4
5
6
template<int TILE_SIZE = 32>
void launch_xxx_sgemm(
const float* A, const float* B, float* C,
int M, int K, int N,
cudaStream_t stream = 0
);
That shared launcher contract makes it easy to benchmark, swap, and verify kernels without changing the surrounding harness.
Validation boundaries
| Area | Local GPU machine | Hosted CI |
|---|---|---|
| CUDA compilation | Yes | Yes |
| Runtime correctness | Yes | No |
| Benchmarking | Yes | No |
| OpenSpec/repository checks | Yes | Yes |
| GitHub Pages buildability | Optional | Yes |
This split is deliberate. The repository does not pretend CI can replace a real CUDA runtime environment.
Repository-level design choices
- Progressive kernels keep optimization steps readable.
- RAII wrappers and exception-style error propagation keep CUDA resource handling predictable.
- OpenSpec governs repo-wide changes so docs, workflows, and validation stay aligned.
- Docs stay role-based: README introduces, Pages teach, OpenSpec defines rules.