Read Real Triton Kernels
Not toy examples — actual matmul and FlashAttention kernels you can run, benchmark, and study line by line. Compact code, detailed comments.
Build FlashAttention from scratch with Triton — master GPU kernel optimization
Compact but Real: Code small enough to read end-to-end, but not a toy. You can:
| Topic | Takeaway |
|---|---|
| GPU Memory Hierarchy | Data flow: HBM → L2 → SRAM → Registers |
| Triton Programming | Auto-tiling, autotune, kernel optimization techniques |
| FlashAttention Algorithm | Online softmax, causal masking, variable-length sequences |
| Performance Tuning | Block size selection, occupancy optimization, memory profiling |
# Install
pip install diy-flash-attention
# Or install from source
pip install -e ".[dev]"
# Verify
python -c "from kernels import flash_attention; print('✓ Installation successful')"import torch
from kernels import flash_attention
# FlashAttention — 99% less memory for long sequences
q = torch.randn(2, 8, 4096, 64, device="cuda", dtype=torch.float16)
k = torch.randn(2, 8, 4096, 64, device="cuda", dtype=torch.float16)
v = torch.randn(2, 8, 4096, 64, device="cuda", dtype=torch.float16)
out = flash_attention(q, k, v, causal=True) # GPT-style causal mask
print(f"Output shape: {out.shape}") # [2, 8, 4096, 64]