CUDA Kernel Academy

Systematic CUDA kernel engineering from SGEMM fundamentals to reusable inference components
从 SGEMM 基础到可复用推理组件的系统性 CUDA 算子工程学习路径

English | 简体中文

A repository for people who want to understand how CUDA kernels evolve from a first correct GEMM into reusable kernels, advanced optimization experiments, and lightweight inference plumbing.

Visit GitHub Pages Read the docs Start with module 01

4core modules

2build systems

1OpenSpec workflow

Why this repo exists

Most CUDA learning material is either too small to feel like engineering or too large to understand end to end. CUDA Kernel Academy sits in the middle:

module 01 teaches the optimization ladder directly on SGEMM
module 02 turns those ideas into a reusable kernel library shape
module 03 explores more advanced CUDA and HPC patterns
module 04 shows how kernels, memory, streams, and configuration fit into a small inference-oriented system

Project map

Module	What you learn	Build path
01-sgemm-tutorial	tiled SGEMM, bank conflicts, double buffering, WMMA	standalone `Makefile`
02-tensorcraft-core	reusable kernel APIs, header-only layout, operator surface	root/module CMake
03-hpc-advanced	advanced optimization topics, experiments, CUDA 12+ features	root/module CMake
04-inference-engine	tensor plumbing, memory pools, streams, lightweight inference flow	root/module CMake

Start here

If you want to...	Go to...
understand CUDA optimization from first principles	01-sgemm-tutorial
inspect a reusable kernel library layout	02-tensorcraft-core
study advanced CUDA/HPC experiments	03-hpc-advanced
see kernels embedded in a tiny system	04-inference-engine
understand how to build, verify, and contribute	docs/README.md

Quick start

git clone https://github.com/LessUp/cuda-kernel-academy.git
cd cuda-kernel-academy

cmake --list-presets
cmake --preset default
cmake --build --preset default
ctest --preset default

For the standalone tutorial:

cd 01-sgemm-tutorial
make GPU_ARCH=sm_86
make test

Build reality

the root CMake graph covers 02-tensorcraft-core, 03-hpc-advanced, 04-inference-engine, common, and examples
01-sgemm-tutorial intentionally stays outside that graph
GitHub Actions only runs CPU-safe checks
real CUDA build and runtime validation should happen on a local GPU machine

Documentation

Requirements

Component	Minimum	Recommended
CUDA Toolkit	12.0	12.x
CMake	3.20	3.24+
Compiler	GCC 9 / Clang 10	GCC 11+
GPU	Volta (sm_70)	Ampere / Ada / Hopper

References

Citation

@misc{cuda-kernel-academy,
  author = {CUDA Kernel Academy Contributors},
  title = {CUDA Kernel Academy},
  year = {2026},
  publisher = {GitHub},
  url = {https://github.com/LessUp/cuda-kernel-academy}
}

License

MIT

🏠 项目首页