DIYFlashAttention

Build FlashAttention from scratch with Triton — master GPU kernel optimization

⚡ 99% Memory Reduction 🚀 1.6x Speedup 📖 Production-Quality Code

Read Real Triton Kernels

Not toy examples — actual matmul and FlashAttention kernels you can run, benchmark, and study line by line. Compact code, detailed comments.

⚡

O(N) Memory Complexity

Understand FlashAttention's breakthrough: online softmax, SRAM tiling, causal masking — all without materializing the full attention matrix.

📊

Real Performance Data

Built-in benchmark scripts comparing against PyTorch SDPA. See exactly why FlashAttention achieves 99% memory savings and 1.6x speedup.

🖥️

Architecture Adaptive

Auto-detects Volta → Blackwell GPUs, adapts configurations automatically. Hopper+ supports TMA and FP8 feature detection.

🧪

Comprehensive Testing

50+ unit tests, Hypothesis property-based testing covers infinite input space. Quality assured, safe for learning reference.

🌐

Bilingual Documentation

All core documentation available in both English and Chinese, accessible to developers worldwide.

Why This Project?

Compact but Real: Code small enough to read end-to-end, but not a toy. You can:

✅ Run real benchmarks on your GPU
✅ Compare performance against PyTorch SDPA
✅ Understand every design decision behind each line

What You'll Learn

Topic	Takeaway
GPU Memory Hierarchy	Data flow: HBM → L2 → SRAM → Registers
Triton Programming	Auto-tiling, autotune, kernel optimization techniques
FlashAttention Algorithm	Online softmax, causal masking, variable-length sequences
Performance Tuning	Block size selection, occupancy optimization, memory profiling

Project Stats

Core Triton Kernels

O(N)

Attention Memory Complexity

GPU Architectures Supported

99%

Memory Saved (Long Sequences)

Quick Start

bash

# Install
pip install diy-flash-attention

# Or install from source
pip install -e ".[dev]"

# Verify
python -c "from kernels import flash_attention; print('✓ Installation successful')"

Run Example

python

import torch
from kernels import flash_attention

# FlashAttention — 99% less memory for long sequences
q = torch.randn(2, 8, 4096, 64, device="cuda", dtype=torch.float16)
k = torch.randn(2, 8, 4096, 64, device="cuda", dtype=torch.float16)
v = torch.randn(2, 8, 4096, 64, device="cuda", dtype=torch.float16)

out = flash_attention(q, k, v, causal=True)  # GPT-style causal mask
print(f"Output shape: {out.shape}")  # [2, 8, 4096, 64]

Learning Paths

🧑‍💻

Kernel Developer

Start with the tutorial, understand FlashAttention line by line

Path: Tutorial → API → Performance

🔬

Researcher

Quick API reference lookup, reproduce and modify kernels

Path: API Reference → Source Code

🚀

Performance Engineer

Deep dive into tuning, understand block sizes and architecture adaptation

Path: Performance Guide → Benchmarks

📚

Learner

Systematic learning of GPU programming and attention optimization

Path: Tutorial → Cheatsheet → FAQ

Start Your FlashAttention Journey

Tutorial for understanding, API for contracts, performance guide for evidence.

🚀 Read Tutorial ⭐ Star on GitHub

Language

🇺🇸 English 🇨🇳 中文

DIYFlashAttention

Read Real Triton Kernels

O(N) Memory Complexity

Real Performance Data

Architecture Adaptive

Comprehensive Testing

Bilingual Documentation

Why This Project? ​

What You'll Learn ​

Project Stats ​

Quick Start ​

Run Example ​

Learning Paths ​

Language ​