Kernel Design

This page explains the main implementation ideas used by the repository’s Triton kernels.

`fused_rmsnorm_rope`

The central idea is to keep the normalized values in registers long enough to apply RoPE before writing the final output.

Design goals:

Why it matters:

The kernel computes two projections for the same input tile:

It then applies the selected activation to the gate projection and multiplies it with the up projection result:

output = activation(gate_proj(x)) * up_proj(x)

This combines projection and activation work in one launch instead of splitting them across separate operations.

The GEMM kernel works with the repository’s FP8 compatibility representation:

The code also uses grouped output-tile ordering to improve cache locality.

The current launchers choose block sizes heuristically from problem dimensions rather than from online autotuning during each call.

Examples:

This keeps the runtime path predictable and small, while leaving more elaborate search to the generic autotuner tools.

Each kernel module also carries a reference implementation in plain PyTorch. Those references are important because they provide:

The design philosophy is not just speed, but speed with a local verification path.