LLM-Speed
A focused CUDA kernel library implementing FlashAttention forward, Tensor Core GEMM acceleration, and seamless PyTorch integration. Designed for efficient LLM inference on modern GPUs.
Key Features
Optimized CUDA kernels for modern LLM inference with memory-efficient algorithms and hardware acceleration
FlashAttention
O(N) memory complexity with online softmax algorithm. Supports causal masking for autoregressive models.
Tensor Core GEMM
Hardware-accelerated matrix multiplication using WMMA API. FP16 input with FP32 accumulation.
PyTorch Integration
Seamless integration with PyTorch via pybind11. Native CUDA tensor support.
Double Buffering
Compute/memory overlap with pipelined execution. Async copy for Ampere+ architectures.
Bank Conflict Free
Carefully designed shared memory layouts with padding to eliminate bank conflicts.
Property Testing
Comprehensive tests with Hypothesis for correctness verification across edge cases.
Memory-Efficient Design
FlashAttention implements online softmax with O(N) memory complexity instead of O(N²) for standard attention.
| Sequence Length | Standard Attention | FlashAttention | Memory Savings |
|---|---|---|---|
| 1024 | 4 MB (full attention matrix) | 0.25 MB (streaming) | 16× |
| 4096 | 64 MB (full attention matrix) | 1 MB (streaming) | 64× |
| 8192 | 256 MB (full attention matrix) | 2 MB (streaming) | 128× |
Assumes 8 attention heads, FP32 accumulation, batch size 1. Exact savings depend on hardware and kernel implementation.
Quick Example
Get started with just a few lines of code
import torch
from cuda_llm_ops import flash_attention
# Create inputs
batch, heads = 2, 8
seq_len, head_dim = 2048, 64
q = torch.randn(batch, heads, seq_len, head_dim,
device='cuda', dtype=torch.float16)
k = torch.randn_like(q)
v = torch.randn_like(q)
# O(N) memory attention!
output = flash_attention(q, k, v, is_causal=True)
import torch
from cuda_llm_ops import tensor_core_gemm
# Matrix multiplication
a = torch.randn(1024, 512, device='cuda',
dtype=torch.float16)
b = torch.randn(512, 1024, device='cuda',
dtype=torch.float16)
# Hardware accelerated GEMM
# FP16 input → FP32 output
c = tensor_core_gemm(a, b)
print(c.dtype) # torch.float32
GPU Architecture Support
Optimized for Ampere (A100, RTX 30) and newer. Forward compatibility with Hopper and future architectures.
| Architecture | Tensor Core | Status |
|---|---|---|
| Ampere (A100, RTX 30/40) | WMMA with FP16, BF16, TF32 | ✅ Primary target |
| Hopper (H100) | WMMA with FP16, BF16, FP8 | ✅ Supported |
| Volta (V100) | WMMA with FP16 | ⚠️ Limited |
| Turing (T4, RTX 20) | WMMA with FP16, INT8 | ⚠️ Limited |
Documentation
Comprehensive guides in English and Chinese
Quick Start
Get up and running in 5 minutes with installation and basic usage examples.
API Reference
Complete API documentation with parameters, examples, and error handling.
Architecture
Technical deep dive into CUDA kernels, optimization strategies, and implementation details.
Performance Guide
Optimization tips, benchmarking tools, and best practices for maximum performance.
Start using LLM-Speed
Three clear paths to get value from this project:
Get Started
Install and run your first FlashAttention or Tensor Core GEMM example in 5 minutes.
Understand Architecture
Explore kernel design, memory layout optimization, and Tensor Core utilization patterns.
Benchmark Locally
Run performance benchmarks on your GPU. See memory usage and speedups with provided tools.