CUDA Kernel Academy
Systematic CUDA kernel engineering from SGEMM fundamentals to reusable inference components
从 SGEMM 基础到可复用推理组件的系统性 CUDA 算子工程学习路径
A repository for people who want to understand how CUDA kernels evolve from a first correct GEMM into reusable kernels, advanced optimization experiments, and lightweight inference plumbing.
4core modules
2build systems
1OpenSpec workflow
Why this repo exists
Most CUDA learning material is either too small to feel like engineering or too large to understand end to end. CUDA Kernel Academy sits in the middle:
- module 01 teaches the optimization ladder directly on SGEMM
- module 02 turns those ideas into a reusable kernel library shape
- module 03 explores more advanced CUDA and HPC patterns
- module 04 shows how kernels, memory, streams, and configuration fit into a small inference-oriented system
Project map
| Module | What you learn | Build path |
|---|---|---|
| 01-sgemm-tutorial | tiled SGEMM, bank conflicts, double buffering, WMMA | standalone Makefile |
| 02-tensorcraft-core | reusable kernel APIs, header-only layout, operator surface | root/module CMake |
| 03-hpc-advanced | advanced optimization topics, experiments, CUDA 12+ features | root/module CMake |
| 04-inference-engine | tensor plumbing, memory pools, streams, lightweight inference flow | root/module CMake |
Start here
| If you want to... | Go to... |
|---|---|
| understand CUDA optimization from first principles | 01-sgemm-tutorial |
| inspect a reusable kernel library layout | 02-tensorcraft-core |
| study advanced CUDA/HPC experiments | 03-hpc-advanced |
| see kernels embedded in a tiny system | 04-inference-engine |
| understand how to build, verify, and contribute | docs/README.md |
Quick start
git clone https://github.com/LessUp/cuda-kernel-academy.git
cd cuda-kernel-academy
cmake --list-presets
cmake --preset default
cmake --build --preset default
ctest --preset default
For the standalone tutorial:
cd 01-sgemm-tutorial
make GPU_ARCH=sm_86
make test
Build reality
- the root CMake graph covers
02-tensorcraft-core,03-hpc-advanced,04-inference-engine,common, andexamples 01-sgemm-tutorialintentionally stays outside that graph- GitHub Actions only runs CPU-safe checks
- real CUDA build and runtime validation should happen on a local GPU machine
Documentation
- Documentation index
- Development workflow
- AI tooling guide
- Installation guide
- Troubleshooting
- Contributing
Requirements
| Component | Minimum | Recommended |
|---|---|---|
| CUDA Toolkit | 12.0 | 12.x |
| CMake | 3.20 | 3.24+ |
| Compiler | GCC 9 / Clang 10 | GCC 11+ |
| GPU | Volta (sm_70) | Ampere / Ada / Hopper |
References
Citation
@misc{cuda-kernel-academy,
author = {CUDA Kernel Academy Contributors},
title = {CUDA Kernel Academy},
year = {2026},
publisher = {GitHub},
url = {https://github.com/LessUp/cuda-kernel-academy}
}
License
MIT