Build Guide
Complete guide for building CuFlash-Attn from source.
Table of Contents
- Prerequisites
- Quick Start
- CMake Presets
- Manual Build
- Build Options
- Running Tests
- GPU Architecture Configuration
- Cross-Platform Notes
- Troubleshooting
Prerequisites
| Dependency | Minimum Version | Notes |
|---|---|---|
| CUDA Toolkit | 11.0 | Includes nvcc compiler and CUDA libraries |
| CMake | 3.18 | Build system generator |
| C++ Compiler | C++17 | GCC 7+, Clang 5+, MSVC 2017+ |
| Python (optional) | 3.8+ | For PyTorch comparison tests |
| PyTorch (optional) | 2.0+ | For verification tests |
Verifying Prerequisites
# Check CUDA version
nvcc --version
# Check CMake version
cmake --version
# Check C++ compiler version (Linux)
g++ --versionQuick Start
# Clone the repository
git clone https://github.com/LessUp/cuflash-attn.git
cd cuflash-attn
# Build using preset (recommended)
cmake --preset release
cmake --build --preset release
# Run tests
ctest --preset release --output-on-failureCMake Presets
The project provides predefined CMake presets for common build configurations:
| Preset | Build Type | Tests | Purpose |
|---|---|---|---|
default | Debug | ✅ | Development and debugging |
release | Release | ✅ | Production use |
release-fast-math | Release | ✅ | Maximum performance (reduced precision) |
minimal | Release | ❌ | Minimal build size |
Preset Build Commands
# Debug build (with tests and examples)
cmake --preset default
cmake --build --preset default
# Optimized release build
cmake --preset release
cmake --build --preset release
# Release with fast math optimizations
cmake --preset release-fast-math
cmake --build --preset release-fast-math
# Minimal build (no tests, no examples)
cmake --preset minimal
cmake --build --preset minimalCustom Preset Overrides
For custom configurations, keep using presets and override cache variables explicitly:
cmake --preset release -DCMAKE_CUDA_ARCHITECTURES=86
cmake --build --preset release -j$(nproc)Specifying CUDA Path
If CMake cannot find CUDA, specify the path explicitly:
Linux/macOS:
cmake --preset release \
-DCUDAToolkit_ROOT=/usr/local/cuda \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvccWindows (PowerShell):
cmake --preset release `
-DCUDAToolkit_ROOT="$env:CUDA_PATH" `
-DCMAKE_CUDA_COMPILER="$env:CUDA_PATH\bin\nvcc.exe"Cross-Platform Library Extensions
| Platform | Shared Library Extension | Static Library Extension |
|---|---|---|
| Linux | .so | .a |
| macOS | .dylib | .a |
| Windows | .dll | .lib |
Build Options
| CMake Option | Default | Description |
|---|---|---|
BUILD_TESTS | ON | Build GoogleTest test suite |
ENABLE_RAPIDCHECK | OFF | Enable RapidCheck property-based tests |
BUILD_SHARED_LIBS | ON | Build as shared library (*.so/.dll/.dylib) |
BUILD_EXAMPLES | ON | Build example programs |
ENABLE_FAST_MATH | OFF | Enable --use_fast_math compiler flag |
ENABLE_FAST_MATH
Enables aggressive mathematical optimizations that trade precision for speed:
cmake --preset release-fast-math
cmake --build --preset release-fast-mathEffects:
- Faster
expf()andlogf()operations - Slightly reduced numerical precision
- Generally acceptable for deep learning training
Example Configurations
# High-performance release build
cmake --preset release-fast-math \
-DBUILD_SHARED_LIBS=OFF
cmake --build --preset release-fast-math
# Debug build with all tests
cmake --preset default \
-DENABLE_RAPIDCHECK=ON
cmake --build --preset default
# Static library only
cmake --preset minimal \
-DBUILD_SHARED_LIBS=OFF
cmake --build --preset minimalRunning Tests
CTest (Recommended)
# Run all tests with preset
ctest --preset release --output-on-failure
# Run specific test
ctest --preset release -R ForwardTest
# Run with verbose output
ctest --preset release -VGoogleTest Direct
Tests are located in build/<preset>/tests/:
# Run all tests
./build/release/tests/cuflash_attn_tests
# Run specific test suite
./build/release/tests/cuflash_attn_tests --gtest_filter="ForwardTest*"
# List all available tests
./build/release/tests/cuflash_attn_tests --gtest_list_testsPyTorch Comparison Tests
Verify numerical correctness against PyTorch reference:
# Build shared library first
cmake --preset release
# Run comparison tests
python tests/test_pytorch_comparison.pyLibrary Path Resolution:
- Environment variable
CUFLASH_LIB build/default/directorybuild/release/directory
Custom Library Path:
CUFLASH_LIB=/path/to/libcuflash_attn.so python tests/test_pytorch_comparison.pyGPU Architecture Configuration
Default Supported Architectures
| Compute Capability | Architecture | Representative GPUs |
|---|---|---|
| sm_70 | Volta | V100 |
| sm_75 | Turing | RTX 2080 Ti |
| sm_80 | Ampere | A100 |
| sm_86 | Ampere | RTX 3090 |
| sm_89 | Ada Lovelace | RTX 4090 |
| sm_90 | Hopper | H100 |
Targeting Specific Architectures
# Single architecture (faster compilation)
cmake --preset release -DCMAKE_CUDA_ARCHITECTURES=86 # RTX 3090/A100
# Multiple architectures
cmake --preset release -DCMAKE_CUDA_ARCHITECTURES="80;86;89"
# Architecture ranges
cmake --preset release -DCMAKE_CUDA_ARCHITECTURES="80-virtual" # Virtual architectureArchitecture Selection Guide
| Use Case | Recommended Configuration | Reason |
|---|---|---|
| Development | -DCMAKE_CUDA_ARCHITECTURES=86 | Faster compilation |
| A100 Cluster | -DCMAKE_CUDA_ARCHITECTURES=80 | Optimal for target |
| H100 Cluster | -DCMAKE_CUDA_ARCHITECTURES=90 | Optimal for target |
| Public Release | Default (all architectures) | Maximum compatibility |
Checking Your GPU Architecture
# Using nvidia-smi (shows current GPU)
nvidia-smi
# Using deviceQuery (sample from CUDA SDK)
/usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery
# Using nvcc
nvcc -arch=sm_70 --run - <<< '__global__ void k(){} int main(){k<<<1,1>>>();}'Cross-Platform Notes
Linux
Standard workflow works out of the box:
sudo apt-get install cmake g++ # Ubuntu/Debian
cmake --preset release
cmake --build --preset release -j$(nproc)macOS
CUDA support on macOS is limited to older versions. NVIDIA no longer provides macOS drivers for newer GPUs.
Windows
Using Visual Studio 2019/2022:
cmake --preset release
cmake --build --preset releaseUsing Ninja (faster):
cmake --preset release
cmake --build --preset releaseCommon Windows Issues:
- Ensure CUDA bin directory is in PATH
- Use x64 Native Tools Command Prompt for Visual Studio
Docker
FROM nvidia/cuda:11.8.0-devel-ubuntu22.04
RUN apt-get update && apt-get install -y cmake g++ git
WORKDIR /workspace
COPY . .
RUN cmake --preset release && \
cmake --build --preset releaseRun with:
docker build -t cuflash-attn .
docker run --gpus all cuflash-attn ./build/release/tests/cuflash_attn_testsTroubleshooting
CMake Cannot Find CUDA
# Explicitly set CUDA paths
cmake --preset release \
-DCUDAToolkit_ROOT=/usr/local/cuda \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvccCompilation Errors with Unknown Architecture
Error: Unknown CUDA architecture sm_XX
Solution: Update CUDA toolkit or specify supported architecture:
cmake --preset release -DCMAKE_CUDA_ARCHITECTURES="70;75;80"Link Errors for Shared Library
Ensure LD_LIBRARY_PATH includes the build directory:
export LD_LIBRARY_PATH=$PWD/build/release:$LD_LIBRARY_PATHOut of Memory During Build
Reduce parallel jobs:
cmake --build --preset release -j2 # Use only 2 parallel jobsTest Failures on GPU-less Systems
The CI workflow is configured to run only format checks when no GPU is available. To run tests locally, ensure you have a CUDA-capable GPU.
Next Steps
- Read the API Reference for usage examples
- Explore the Algorithm Documentation for implementation details
- Check Troubleshooting for common issues