Troubleshooting Guide
Common issues and solutions when building and using CuFlash-Attn.
Table of Contents
Build Issues
CMake Cannot Find CUDA
Symptoms:
CMake Error: Could not find CUDASolutions:
Verify CUDA Installation:
bashnvcc --version nvidia-smiExplicitly Set CUDA Paths:
bashcmake --preset release \ -DCUDAToolkit_ROOT=/usr/local/cuda \ -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvccCommon Path Locations:
OS Default CUDA Path Linux /usr/local/cudaWindows C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8Set Environment Variables:
bashexport PATH=/usr/local/cuda/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
Unsupported GPU Architecture
Symptoms:
nvcc fatal : Unsupported GPU architecture 'sm_89'Cause: CUDA toolkit version doesn't support your GPU architecture.
Solutions:
Update CUDA Toolkit (recommended for newer GPUs)
Target Compatible Architecture:
bash# Check supported architectures nvcc --help | grep "gpu-architecture" # Build for supported architecture cmake --preset release -DCMAKE_CUDA_ARCHITECTURES=80 # Adjust to your CUDA versionCompatibility Matrix:
Architecture Minimum CUDA Version sm_70 (V100) CUDA 9.0 sm_75 (Turing) CUDA 10.0 sm_80 (A100) CUDA 11.0 sm_86 (RTX 3090) CUDA 11.1 sm_89 (RTX 4090) CUDA 11.8 sm_90 (H100) CUDA 12.0
Out of Memory During Build
Symptoms:
nvcc fatal : Memory allocation failureSolutions:
Reduce Parallel Jobs:
bashcmake --build --preset release -j2 # Use only 2 parallel jobsReduce Target Architectures:
bash# Build for single architecture cmake --preset release -DCMAKE_CUDA_ARCHITECTURES=86Close Other Applications: Free up system memory
Linker Errors
Symptoms:
undefined reference to `cuflash::flash_attention_forward'Solutions:
Verify Library Build:
bashls -la build/release/libcuflash_attn*Set Library Path:
bashexport LD_LIBRARY_PATH=$PWD/build/release:$LD_LIBRARY_PATHLink Correctly in Your Project:
cmaketarget_link_libraries(your_target cuflash_attn)
Runtime Errors
CUDA Out of Memory
Error Code: FlashAttentionError::OUT_OF_MEMORY
Symptoms:
CUDA error: out of memorySolutions:
Check Available Memory:
bashnvidia-smiReduce Batch Size or Sequence Length:
Configuration Memory Impact batch_size Linear seq_len Linear num_heads Linear head_dim Fixed (32/64/128) Memory Estimation Formulas:
Forward: ~4 × batch × heads × seq_len × head_dim (bytes for FP32) Backward: ~6 × batch × heads × seq_len × head_dim (bytes for FP32)Free GPU Memory Before Running:
pythonimport torch torch.cuda.empty_cache()
Unsupported Head Dimension
Error Code: FlashAttentionError::UNSUPPORTED_HEAD_DIM
Valid Values: 32, 64, 128
Solutions:
Check Your Configuration:
cppif (head_dim != 32 && head_dim != 64 && head_dim != 128) { // Not supported }Workarounds:
- Pad to nearest supported dimension
- Use multiple calls for heads of different sizes
Invalid Dimension Parameters
Error Code: FlashAttentionError::INVALID_DIMENSION
Cause: batch_size, num_heads, seq_len, or head_dim ≤ 0
Solution: Verify all dimension parameters are positive integers.
Null Pointer
Error Code: FlashAttentionError::NULL_POINTER
Cause: One or more input/output pointers are nullptr
Solution: Verify all tensor pointers are properly allocated:
cudaMalloc(&d_Q, batch_size * num_heads * seq_len * head_dim * sizeof(float));
// ... allocate all required tensorsCUDA Runtime Errors
Error Code: FlashAttentionError::CUDA_ERROR
Common Causes:
Invalid Memory Access:
- Pointers not allocated on device memory
- Memory corruption from other operations
Kernel Launch Failure:
- Too many threads or blocks for GPU
- Resource conflicts
Debug Steps:
bash# Enable CUDA error checking export CUDA_LAUNCH_BLOCKING=1 # Run with cuda-memcheck cuda-memcheck ./your_program # Use compute-sanitizer (CUDA 11+) compute-sanitizer ./your_program
Performance Issues
Slower Than Expected
Diagnostic Steps:
Verify GPU Utilization:
bashnvidia-smi -l 1 # Monitor GPU usageCheck Architecture Match:
- Ensure binary is compiled for your GPU architecture
- Rebuild with correct
CMAKE_CUDA_ARCHITECTURES
Enable Fast Math (if precision allows):
bashcmake --preset release-fast-math cmake --build --preset release-fast-mathProfile Kernel Execution:
bash# Use Nsight Compute ncu ./your_program # Use Nsight Systems nsys profile ./your_program
High Memory Usage
Note: This is expected for head_dim=128 due to larger shared memory requirements.
Optimization Options:
Use FP16 Instead of FP32:
- Halves memory usage
- Minimal accuracy impact for most applications
Reduce Block Size (if customized):
- Smaller blocks → less shared memory per block
- May impact performance
Numerical Accuracy
Results Differ from PyTorch
Expected Differences:
| Aspect | Expected Behavior |
|---|---|
| Small numerical differences | ±1e-5 for FP32, ±1e-3 for FP16 |
| FP16 accumulation | Higher variance due to rounding |
Diagnostic Steps:
Check Scale Factor:
cppfloat scale = 1.0f / std::sqrt(static_cast<float>(head_dim)); // Should match PyTorch defaultVerify Causal Masking:
- Ensure
causalparameter matches between implementations
- Ensure
Use Same Data Type:
- Compare FP32 to FP32
- Compare FP16 to FP16
INF/NaN in Output
Causes:
Input Contains INF/NaN:
pythonimport torch assert not torch.isnan(Q).any() assert not torch.isinf(Q).any()Extreme QK Values:
- Enable causal masking if sequence is autoregressive
- Check scale factor calculation
FP16 Overflow:
- Use FP32 for problematic inputs
- Enable gradient clipping in training
Gradient Mismatch in Backward Pass
Common Causes:
Missing Logsumexp (L):
- Must pass L from forward pass to backward pass
- L must be from the same forward call
Incorrect Gradient Flow:
- dO should be gradients w.r.t. O (forward output)
- dQ, dK, dV are output parameters
Error Code Reference
| Error Code | Value | Meaning | Resolution |
|---|---|---|---|
SUCCESS | 0 | Operation successful | N/A |
INVALID_DIMENSION | 1 | Dimension ≤ 0 | Check all dimension parameters |
DIMENSION_MISMATCH | 2 | Reserved | Not currently used |
NULL_POINTER | 3 | Null pointer passed | Verify all allocations |
CUDA_ERROR | 4 | CUDA runtime error | Check CUDA context and memory |
OUT_OF_MEMORY | 5 | GPU OOM | Reduce problem size or free memory |
UNSUPPORTED_HEAD_DIM | 6 | head_dim not in | Use supported dimension |
UNSUPPORTED_DTYPE | 7 | Data type not supported | Use float or half |
// Comprehensive error handling example
auto err = cuflash::flash_attention_forward(...);
switch (err) {
case cuflash::FlashAttentionError::SUCCESS:
break;
case cuflash::FlashAttentionError::OUT_OF_MEMORY:
std::cerr << "GPU out of memory. Try reducing batch size.\n";
break;
case cuflash::FlashAttentionError::UNSUPPORTED_HEAD_DIM:
std::cerr << "head_dim must be 32, 64, or 128.\n";
break;
default:
std::cerr << "Error: " << cuflash::get_error_string(err) << "\n";
}Getting Help
Before Asking
- Check Error Code: Use
get_error_string()for detailed message - Verify Setup: Run
nvidia-smiandnvcc --version - Test Basic Functionality: Run built-in tests with
ctest
Reporting Issues
When reporting issues, include:
System Information:
bashnvidia-smi nvcc --version cmake --versionError Output: Full error message and stack trace
Minimal Reproduction: Small code snippet that triggers the issue
Build Configuration: CMake cache or preset used