Skip to content

CuFlash-AttnCUDA FlashAttention Reference

O(N) memory • FP32/FP16 • Forward/Backward • Archive-ready v0.3.0

CuFlash-Attn

Why CuFlash-Attn?

Choose this library when

You want to understand FlashAttention internals, experiment with attention mechanisms, or integrate without heavy framework dependencies.

Quick Comparison

FeatureCuFlash-AttnPyTorch SDPAFlashAttention-2
Educational code⚠️
No dependencies❌ PyTorch
Python binding✅ ctypes✅ native
Training support
Customizable✅ easy⚠️ hard⚠️

Quick Start

Get running in under 5 minutes:

bash
git clone https://github.com/LessUp/cuflash-attn.git
cd cuflash-attn

cmake --preset release
cmake --build --preset release

ctest --preset release --output-on-failure
cpp
#include "cuflash/flash_attention.h"

auto err = cuflash::flash_attention_forward(
    d_Q, d_K, d_V, d_O, d_L,
    batch_size, num_heads, seq_len, head_dim,
    scale, true, stream
);
python
import ctypes
lib = ctypes.CDLL("./build/release/libcuflash_attn.so")

# Call via C ABI
lib.cuflash_attention_forward_f32(
    q_ptr, k_ptr, v_ptr, o_ptr, l_ptr,
    B, H, N, D, scale, True, None
)

Memory Efficiency

Seq LengthStandard AttentionFlashAttentionSavings
1,0244 MB8 KB99.8%
4,09664 MB32 KB99.95%
16,3841 GB128 KB99.99%

Documentation

ResourceDescription
Quick Start GuidePreset-based build path
Building from SourcePlatforms, presets, overrides
API ReferenceComplete C++ and C ABI docs
Algorithm Deep DiveTiling, online softmax, recomputation
TroubleshootingCommon issues and solutions

Project Status

Stable v0.3.0 baseline — Archive-ready reference implementation. Current focus: documentation quality, workflow simplification, and bug fixes.

See Project Status for maintenance posture and governance rules.

OpenSpec Specification

This project follows OpenSpec methodology. Canonical requirements:

Stable v0.3.0 baseline • OpenSpec-driven CUDA FlashAttention reference.

Contributors