Documentation

⚡

Linear Memory

Handle 16K+ token sequences with O(N) memory via FlashAttention tiling — 99.9% less than standard attention.

🎯

Reference Quality

Clean, educational CUDA C++ implementation. No framework dependencies. Easy to understand, modify, and integrate.

🔢

Full Precision Support

FP32 and FP16 with numerically-aware accumulation. Forward and backward passes for complete training support.

🎭

Causal Masking

Built-in support for autoregressive models. Enable with a single boolean flag in the API.

🚀

Multi-GPU Architecture

Optimized kernels for V100 through H100 (sm_70 → sm_90). Production-ready CUDA performance.

📦

Python Ready

C ABI bindings for ctypes integration. Works with PyTorch, NumPy, or raw GPU memory pointers.

Why CuFlash-Attn?

Choose this library when

You want to understand FlashAttention internals, experiment with attention mechanisms, or integrate without heavy framework dependencies.

Quick Comparison

Feature	CuFlash-Attn	PyTorch SDPA	FlashAttention-2
Educational code	✅	❌	⚠️
No dependencies	✅	❌ PyTorch	❌
Python binding	✅ ctypes	✅ native	✅
Training support	✅	✅	✅
Customizable	✅ easy	⚠️ hard	⚠️

Feature

CuFlash-Attn

PyTorch SDPA

FlashAttention-2

Educational code

✅

❌

⚠️

No dependencies

✅

❌ PyTorch

❌

Python binding

✅ ctypes

✅ native

✅

Training support

✅

Customizable

✅ easy

⚠️ hard

⚠️

Quick Start

Get running in under 5 minutes:

Clone & BuildC++ UsagePython Binding

bash

git clone https://github.com/LessUp/cuflash-attn.git
cd cuflash-attn

cmake --preset release
cmake --build --preset release

ctest --preset release --output-on-failure

cpp

#include "cuflash/flash_attention.h"

auto err = cuflash::flash_attention_forward(
    d_Q, d_K, d_V, d_O, d_L,
    batch_size, num_heads, seq_len, head_dim,
    scale, true, stream
);

python

import ctypes
lib = ctypes.CDLL("./build/release/libcuflash_attn.so")

# Call via C ABI
lib.cuflash_attention_forward_f32(
    q_ptr, k_ptr, v_ptr, o_ptr, l_ptr,
    B, H, N, D, scale, True, None
)

Seq Length	Standard Attention	FlashAttention	Savings
1,024	4 MB	8 KB	99.8%
4,096	64 MB	32 KB	99.95%
16,384	1 GB	128 KB	99.99%

Seq Length

Standard Attention

FlashAttention

Savings

1,024

4 MB

8 KB

99.8%

4,096

64 MB

32 KB

99.95%

16,384

1 GB

128 KB

99.99%

Resource	Description
Quick Start Guide	Preset-based build path
Building from Source	Platforms, presets, overrides
API Reference	Complete C++ and C ABI docs
Algorithm Deep Dive	Tiling, online softmax, recomputation
Troubleshooting	Common issues and solutions

Resource

Description

Quick Start Guide

Preset-based build path

Building from Source

Platforms, presets, overrides

API Reference

Complete C++ and C ABI docs

Algorithm Deep Dive

Tiling, online softmax, recomputation

Troubleshooting

Common issues and solutions

CuFlash-AttnCUDA FlashAttention Reference

Linear Memory

Reference Quality

Full Precision Support

Causal Masking

Multi-GPU Architecture

Python Ready

Why CuFlash-Attn?

Quick Comparison

Quick Start

Memory Efficiency

Project Status

OpenSpec Specification

Contributors

CuFlash-AttnCUDA FlashAttention Reference

Linear Memory

Reference Quality

Full Precision Support

Causal Masking

Multi-GPU Architecture

Python Ready

Why CuFlash-Attn? ​

Quick Comparison ​

Quick Start ​

Memory Efficiency ​

Documentation ​

Project Status ​

OpenSpec Specification ​

Contributors

Why CuFlash-Attn?

Quick Comparison

Quick Start

Memory Efficiency

Documentation

Project Status

OpenSpec Specification