v0.3.0 — CUDA kernels for LLM inference

LLM-Speed

A focused CUDA kernel library implementing FlashAttention forward, Tensor Core GEMM acceleration, and seamless PyTorch integration. Designed for efficient LLM inference on modern GPUs.

CI License CUDA C++ Python

Key Features

Optimized CUDA kernels for modern LLM inference with memory-efficient algorithms and hardware acceleration

FlashAttention

O(N) memory complexity with online softmax algorithm. Supports causal masking for autoregressive models.

🔢

Tensor Core GEMM

Hardware-accelerated matrix multiplication using WMMA API. FP16 input with FP32 accumulation.

🐍

PyTorch Integration

Seamless integration with PyTorch via pybind11. Native CUDA tensor support.

🔄

Double Buffering

Compute/memory overlap with pipelined execution. Async copy for Ampere+ architectures.

🏦

Bank Conflict Free

Carefully designed shared memory layouts with padding to eliminate bank conflicts.

📊

Property Testing

Comprehensive tests with Hypothesis for correctness verification across edge cases.

Memory-Efficient Design

FlashAttention implements online softmax with O(N) memory complexity instead of O(N²) for standard attention.

Sequence Length Standard Attention FlashAttention Memory Savings
1024 4 MB (full attention matrix) 0.25 MB (streaming) 16×
4096 64 MB (full attention matrix) 1 MB (streaming) 64×
8192 256 MB (full attention matrix) 2 MB (streaming) 128×

Assumes 8 attention heads, FP32 accumulation, batch size 1. Exact savings depend on hardware and kernel implementation.

Quick Example

Get started with just a few lines of code

flash_attention.py
import torch
from cuda_llm_ops import flash_attention

# Create inputs
batch, heads = 2, 8
seq_len, head_dim = 2048, 64

q = torch.randn(batch, heads, seq_len, head_dim,
                device='cuda', dtype=torch.float16)
k = torch.randn_like(q)
v = torch.randn_like(q)

# O(N) memory attention!
output = flash_attention(q, k, v, is_causal=True)
tensor_core_gemm.py
import torch
from cuda_llm_ops import tensor_core_gemm

# Matrix multiplication
a = torch.randn(1024, 512, device='cuda',
                dtype=torch.float16)
b = torch.randn(512, 1024, device='cuda',
                dtype=torch.float16)

# Hardware accelerated GEMM
# FP16 input → FP32 output
c = tensor_core_gemm(a, b)
print(c.dtype)  # torch.float32

GPU Architecture Support

Optimized for Ampere (A100, RTX 30) and newer. Forward compatibility with Hopper and future architectures.

Architecture Tensor Core Status
Ampere (A100, RTX 30/40) WMMA with FP16, BF16, TF32 ✅ Primary target
Hopper (H100) WMMA with FP16, BF16, FP8 ✅ Supported
Volta (V100) WMMA with FP16 ⚠️ Limited
Turing (T4, RTX 20) WMMA with FP16, INT8 ⚠️ Limited

Documentation

Comprehensive guides in English and Chinese

Start using LLM-Speed

Three clear paths to get value from this project: