Skip to content

DIYFlashAttention

Build FlashAttention from scratch with Triton — master GPU kernel optimization

⚡ 99% Memory Reduction 🚀 1.6x Speedup 📖 Production-Quality Code

Why This Project?

Compact but Real: Code small enough to read end-to-end, but not a toy. You can:

  • ✅ Run real benchmarks on your GPU
  • ✅ Compare performance against PyTorch SDPA
  • ✅ Understand every design decision behind each line

What You'll Learn

TopicTakeaway
GPU Memory HierarchyData flow: HBM → L2 → SRAM → Registers
Triton ProgrammingAuto-tiling, autotune, kernel optimization techniques
FlashAttention AlgorithmOnline softmax, causal masking, variable-length sequences
Performance TuningBlock size selection, occupancy optimization, memory profiling

Project Stats

2+
Core Triton Kernels
O(N)
Attention Memory Complexity
6
GPU Architectures Supported
99%
Memory Saved (Long Sequences)

Quick Start

bash
# Install
pip install diy-flash-attention

# Or install from source
pip install -e ".[dev]"

# Verify
python -c "from kernels import flash_attention; print('✓ Installation successful')"

Run Example

python
import torch
from kernels import flash_attention

# FlashAttention — 99% less memory for long sequences
q = torch.randn(2, 8, 4096, 64, device="cuda", dtype=torch.float16)
k = torch.randn(2, 8, 4096, 64, device="cuda", dtype=torch.float16)
v = torch.randn(2, 8, 4096, 64, device="cuda", dtype=torch.float16)

out = flash_attention(q, k, v, causal=True)  # GPT-style causal mask
print(f"Output shape: {out.shape}")  # [2, 8, 4096, 64]

Learning Paths

🧑‍💻
Kernel Developer
Start with the tutorial, understand FlashAttention line by line
Path: Tutorial → API → Performance
🔬
Researcher
Quick API reference lookup, reproduce and modify kernels
Path: API Reference → Source Code
🚀
Performance Engineer
Deep dive into tuning, understand block sizes and architecture adaptation
Path: Performance Guide → Benchmarks
📚
Learner
Systematic learning of GPU programming and attention optimization
Path: Tutorial → Cheatsheet → FAQ
Start Your FlashAttention Journey
Tutorial for understanding, API for contracts, performance guide for evidence.

Language

Forward-only educational Triton FlashAttention project · MIT License