Troubleshooting Guide

Solutions to common issues when using LLM-Speed.

Installation Issues
Runtime Errors
Performance Issues
Numerical Issues
Getting Help

Installation Issues

Baseline environment not prepared

Symptom:

ImportError / ModuleNotFoundError during test collection

Reason: Validation started before the documented local Python environment was prepared.

Fix:

python3 -m venv .venv
. .venv/bin/activate
pip install -U pip setuptools wheel
pip install -r requirements.txt pytest hypothesis ruff pre-commit

CUDA Not Found

Error:

RuntimeError: CUDA not available. Please check your CUDA installation.

Solutions:

Verify CUDA installation:
```
nvcc --version
nvidia-smi
```

Check PyTorch CUDA support:

import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")

Reinstall PyTorch with correct CUDA version: ```bash
For CUDA 11.8

pip install torch –index-url https://download.pytorch.org/whl/cu118

For CUDA 12.1

pip install torch –index-url https://download.pytorch.org/whl/cu121

### Build Errors

**Error:**

error: command ‘gcc’ failed with exit status 1

**Solutions:**

1. **Check GCC version:**
```bash
gcc --version  # Need GCC 9.0+

Set CUDA architecture flags: ```bash
For specific GPU architecture

CUDA_ARCHS=”80” pip install -e . # A100

For multiple architectures

CUDA_ARCHS=”75;80;86” pip install -e .

3. **Common fixes:**
```bash
# Clear build cache
rm -rf build/
rm -rf *.egg-info

# Rebuild with verbose output
pip install -e . --verbose

Import Errors

Error:

ImportError: No module named 'cuda_llm_ops'

Solutions:

Verify installation:

pip list | grep cuda
python -c "import cuda_llm_ops; print(cuda_llm_ops.__version__)"

Check Python path:
```
import sys
print(sys.path)
```

Reinstall:

pip uninstall cuda_llm_ops
pip install -e .

Runtime Errors

CUDA Out of Memory

Error:

RuntimeError: CUDA out of memory. Tried to allocate X GB

Solutions:

Use FlashAttention (O(N) memory) instead of naive attention: ```python
Bad - may OOM for long sequences

from cuda_llm_ops import naive_attention output = naive_attention(q, k, v) # O(N²) memory

Good - memory efficient

from cuda_llm_ops import flash_attention output = flash_attention(q, k, v) # O(N) memory

2. **Reduce batch size or sequence length:**
```python
# Check memory before operation
print(torch.cuda.memory_summary())

# Try smaller batch
batch_size = 2  # Instead of 8

Clear cache:
```
torch.cuda.empty_cache()
```

Mixed precision:

# Use FP16 instead of FP32
q = q.half()

Shared Memory Limit

Error:

RuntimeError: naive_attention: seq_len=4096 requires 16404 bytes shared memory,
but device max is 49152 bytes.

Solution: Use flash_attention or tiled_attention for long sequences:

if seq_len > 2048:
    output = flash_attention(q, k, v)  # No shared memory limit
else:
    output = tiled_attention(q, k, v)

Tensor Shape Mismatch

Error:

RuntimeError: K and V must have same shape

Solution: Ensure Q, K, V have identical shapes:

print(f"Q shape: {q.shape}")
print(f"K shape: {k.shape}")
print(f"V shape: {v.shape}")

# All should be: [batch, heads, seq_len, head_dim]
assert q.shape == k.shape == v.shape

Wrong Device

Error:

RuntimeError: Q must be on CUDA device

Solution: Move tensors to GPU:

# Check device
print(f"Q device: {q.device}")

# Move to CUDA if needed
q = q.cuda()
# Or during creation
q = torch.randn(..., device='cuda', dtype=torch.float16)

Non-Contiguous Tensors

Error:

RuntimeError: Q must be contiguous

Solution:

# Make contiguous
q = q.contiguous()

# Or during transpose
def safe_transpose(tensor, dim0, dim1):
    """Transpose and make contiguous."""
    return tensor.transpose(dim0, dim1).contiguous()

Unsupported Data Type

Error:

RuntimeError: Only float32 and float16 are supported

Solution:

# Convert to supported dtype
q = q.half()      # FP16
# or
q = q.float()     # FP32

Wrong Dimensions

Error:

RuntimeError: Q must be 4D tensor [batch, heads, seq_len, head_dim]

Solution:

# Expected: [batch, heads, seq_len, head_dim]
print(f"Current shape: {q.shape}")
print(f"Dimensions: {q.dim()}")

# Reshape if needed
q = q.view(batch, heads, seq_len, head_dim)

INT8 Tensor Core Not Available

Error:

RuntimeError: INT8 Tensor Core requires Turing+ architecture (SM 7.2+)

Solution: Check GPU compute capability:

import torch
capability = torch.cuda.get_device_capability()
print(f"Compute capability: {capability}")

if capability[0] > 7 or (capability[0] == 7 and capability[1] >= 2):
    # Turing or better
    from cuda_llm_ops import tensor_core_gemm_int8
    c = tensor_core_gemm_int8(a_int8, b_int8)
else:
    # Fallback to FP16
    from cuda_llm_ops import tensor_core_gemm
    c = tensor_core_gemm(a_fp16, b_fp16)

Performance Issues

Slow Execution

Symptoms:

Operations take much longer than expected
GPU utilization is low in nvidia-smi

Solutions:

Use optimal kernel for sequence length: ```python seq_len = q.size(2)

if seq_len >= 512: output = flash_attention(q, k, v) # Best for long sequences elif seq_len >= 128: output = tiled_attention(q, k, v) # Good for medium sequences else: output = naive_attention(q, k, v) # Okay for short sequences

2. **Check alignment:**
```python
def check_alignment(M, N, K):
    for dim, name in [(M, 'M'), (N, 'N'), (K, 'K')]:
        if dim % 16 != 0:
            print(f"Warning: {name}={dim} not aligned to 16")

check_alignment(1024, 512, 1024)

Ensure warmup:

# GPU needs warmup for consistent timing
for _ in range(10):
 _ = flash_attention(q, k, v)
torch.cuda.synchronize()

Low GPU Utilization

Symptoms:

nvidia-smi shows utilization < 50%
CPU bottleneck

Solutions:

Increase batch size: ```python
Too small

batch = 1 q = torch.randn(1, heads, seq_len, head_dim, device=’cuda’)

Better

batch = 8 q = torch.randn(8, heads, seq_len, head_dim, device=’cuda’)

2. **Remove CPU-GPU synchronization:**
```python
# Bad - forces CPU wait
result = flash_attention(q, k, v)
torch.cuda.synchronize()
print(result.cpu())

# Better - batch operations
results = []
for _ in range(100):
    results.append(flash_attention(q, k, v))
torch.cuda.synchronize()

Tensor Core Not Used

Symptoms:

Performance significantly below cuBLAS
Nsight Compute shows no Tensor Core usage

Solutions:

Use Tensor Core variant: ```python
Uses regular CUDA cores

c = gemm(a, b)

Uses Tensor Cores

c = tensor_core_gemm(a, b)

2. **Ensure FP16 input:**
```python
# Must be FP16 for Tensor Core
c = tensor_core_gemm(a.half(), b.half())

Check alignment:

# All dimensions should be multiples of 16
M, K, N = 1024, 512, 1024  # Good: all divisible by 16

Numerical Issues

Precision Loss in FP16

Symptoms:

Results differ significantly from FP32 reference
NaN or Inf values

Solutions:

Use Tensor Core GEMM for accumulation:

# FP16 computation with FP32 accumulation
c = tensor_core_gemm(a_fp16, b_fp16)  # Returns FP32

Scale values for FP16:

# FP16 has limited range [-65504, 65504]
# Scale down large values
scale = 1.0 / 256.0
q = q * scale
output = flash_attention(q, k, v)
output = output / scale

Gradient scaling (for training): ```python from torch.cuda.amp import GradScaler

scaler = GradScaler() with torch.cuda.amp.autocast(): output = flash_attention(q, k, v) scaler.scale(loss).backward()

### Output Does Not Match PyTorch

**Symptoms:**
- Custom kernel output differs from `torch.nn.functional.scaled_dot_product_attention`

**Solutions:**

1. **Check tolerances:**
```python
torch.testing.assert_close(
    output_custom,
    output_torch,
    rtol=1e-3,  # Relative tolerance
    atol=1e-3   # Absolute tolerance
)

Expected differences:

# FP16 has ~3-4 decimal digits of precision
# Small differences are normal for different implementations

Getting Help

Diagnostic Script

Run this to collect system information:

#!/usr/bin/env python3
import sys
import torch
import cuda_llm_ops

print("=" * 60)
print("System Information")
print("=" * 60)
print(f"Python version: {sys.version}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")

if torch.cuda.is_available():
    print(f"Device count: {torch.cuda.device_count()}")
    print(f"Device name: {torch.cuda.get_device_name(0)}")
    print(f"Compute capability: {torch.cuda.get_device_capability()}")
    print(f"cuda_llm_ops version: {cuda_llm_ops.__version__}")

print("=" * 60)
print("Quick Test")
print("=" * 60)

try:
    q = torch.randn(2, 4, 64, 32, device='cuda', dtype=torch.float16)
    k = torch.randn_like(q)
    v = torch.randn_like(q)
    output = cuda_llm_ops.flash_attention(q, k, v)
    print("✓ FlashAttention test passed")
except Exception as e:
    print(f"✗ FlashAttention test failed: {e}")

try:
    a = torch.randn(512, 512, device='cuda', dtype=torch.float16)
    b = torch.randn(512, 512, device='cuda', dtype=torch.float16)
    c = cuda_llm_ops.gemm(a, b)
    print("✓ GEMM test passed")
except Exception as e:
    print(f"✗ GEMM test failed: {e}")

Submit an Issue

When reporting issues, please include:

System information (from script above)
Minimal reproduction code
Expected vs actual behavior
Full error message with stack trace

Example issue template:

## Environment
- GPU: NVIDIA A100
- CUDA: 12.1
- Python: 3.10
- PyTorch: 2.1.0
- cuda_llm_ops: 0.3.0

## Issue Description
FlashAttention fails with OOM on 8K sequence length

## Reproduction Code
```python
import torch
from cuda_llm_ops import flash_attention

q = torch.randn(2, 16, 8192, 64, device='cuda', dtype=torch.float16)
k = torch.randn_like(q)
v = torch.randn_like(q)
output = flash_attention(q, k, v)  # OOM here

Error Message

RuntimeError: CUDA out of memory. Tried to allocate 4.00 GB ```

Resources

GitHub Issues: https://github.com/LessUp/llm-speed/issues
Documentation: https://lessup.github.io/llm-speed/
Discussions: https://github.com/LessUp/llm-speed/discussions

← Back to Documentation

Troubleshooting Guide

Table of Contents

Installation Issues

Baseline environment not prepared

CUDA Not Found

For CUDA 11.8

For CUDA 12.1

For specific GPU architecture

For multiple architectures

Import Errors

Runtime Errors

CUDA Out of Memory

Bad - may OOM for long sequences

Good - memory efficient

Shared Memory Limit

Tensor Shape Mismatch

Wrong Device

Non-Contiguous Tensors

Unsupported Data Type

Wrong Dimensions

INT8 Tensor Core Not Available

Performance Issues

Slow Execution

Low GPU Utilization

Too small

Better

Tensor Core Not Used

Uses regular CUDA cores

Uses Tensor Cores

Numerical Issues

Precision Loss in FP16

Getting Help

Diagnostic Script

Submit an Issue

Error Message

Resources