Internals
This section explains how the library is structured and why the kernels are organized the way they are.
Architecture Module layout and responsibilities See how the public API, validation helpers, kernels, autotuner, and benchmark code fit together. Kernel Design Tiling and fusion strategy Read the core ideas behind the Triton kernels and their memory access patterns. Memory HBM reduction and SRAM reuse Understand why fusion matters and how the library reduces traffic to global memory.