Linear Memory
Handle 16K+ token sequences with O(N) memory via FlashAttention tiling — 99.9% less than standard attention.
O(N) memory • FP32/FP16 • Forward/Backward • Archive-ready v0.3.0
Choose this library when
You want to understand FlashAttention internals, experiment with attention mechanisms, or integrate without heavy framework dependencies.
| Feature | CuFlash-Attn | PyTorch SDPA | FlashAttention-2 |
|---|---|---|---|
| Educational code | ✅ | ❌ | ⚠️ |
| No dependencies | ✅ | ❌ PyTorch | ❌ |
| Python binding | ✅ ctypes | ✅ native | ✅ |
| Training support | ✅ | ✅ | ✅ |
| Customizable | ✅ easy | ⚠️ hard | ⚠️ |
Get running in under 5 minutes:
git clone https://github.com/LessUp/cuflash-attn.git
cd cuflash-attn
cmake --preset release
cmake --build --preset release
ctest --preset release --output-on-failure#include "cuflash/flash_attention.h"
auto err = cuflash::flash_attention_forward(
d_Q, d_K, d_V, d_O, d_L,
batch_size, num_heads, seq_len, head_dim,
scale, true, stream
);import ctypes
lib = ctypes.CDLL("./build/release/libcuflash_attn.so")
# Call via C ABI
lib.cuflash_attention_forward_f32(
q_ptr, k_ptr, v_ptr, o_ptr, l_ptr,
B, H, N, D, scale, True, None
)| Seq Length | Standard Attention | FlashAttention | Savings |
|---|---|---|---|
| 1,024 | 4 MB | 8 KB | 99.8% |
| 4,096 | 64 MB | 32 KB | 99.95% |
| 16,384 | 1 GB | 128 KB | 99.99% |
| Resource | Description |
|---|---|
| Quick Start Guide | Preset-based build path |
| Building from Source | Platforms, presets, overrides |
| API Reference | Complete C++ and C ABI docs |
| Algorithm Deep Dive | Tiling, online softmax, recomputation |
| Troubleshooting | Common issues and solutions |
Stable v0.3.0 baseline — Archive-ready reference implementation. Current focus: documentation quality, workflow simplification, and bug fixes.
See Project Status for maintenance posture and governance rules.
This project follows OpenSpec methodology. Canonical requirements: