v0.3.0 — CUDA kernels for LLM inference

LLM-Speed

A focused CUDA kernel library implementing FlashAttention forward, Tensor Core GEMM acceleration, and seamless PyTorch integration. Designed for efficient LLM inference on modern GPUs.

🚀 Get Started 📖 Understand Architecture 📊 Benchmark Locally

Key Features

Optimized CUDA kernels for modern LLM inference with memory-efficient algorithms and hardware acceleration

⚡

FlashAttention

O(N) memory complexity with online softmax algorithm. Supports causal masking for autoregressive models.

🔢

Tensor Core GEMM

Hardware-accelerated matrix multiplication using WMMA API. FP16 input with FP32 accumulation.

🐍

PyTorch Integration

Seamless integration with PyTorch via pybind11. Native CUDA tensor support.

🔄

Double Buffering

Compute/memory overlap with pipelined execution. Async copy for Ampere+ architectures.

🏦

Bank Conflict Free

Carefully designed shared memory layouts with padding to eliminate bank conflicts.

📊

Property Testing

Comprehensive tests with Hypothesis for correctness verification across edge cases.

Memory-Efficient Design

FlashAttention implements online softmax with O(N) memory complexity instead of O(N²) for standard attention.

Sequence Length	Standard Attention	FlashAttention	Memory Savings
1024	4 MB (full attention matrix)	0.25 MB (streaming)	16×
4096	64 MB (full attention matrix)	1 MB (streaming)	64×
8192	256 MB (full attention matrix)	2 MB (streaming)	128×

Assumes 8 attention heads, FP32 accumulation, batch size 1. Exact savings depend on hardware and kernel implementation.

Quick Example

Get started with just a few lines of code

flash_attention.py

import torch
from cuda_llm_ops import flash_attention

# Create inputs
batch, heads = 2, 8
seq_len, head_dim = 2048, 64

q = torch.randn(batch, heads, seq_len, head_dim,
                device='cuda', dtype=torch.float16)
k = torch.randn_like(q)
v = torch.randn_like(q)

# O(N) memory attention!
output = flash_attention(q, k, v, is_causal=True)

tensor_core_gemm.py

import torch
from cuda_llm_ops import tensor_core_gemm

# Matrix multiplication
a = torch.randn(1024, 512, device='cuda',
                dtype=torch.float16)
b = torch.randn(512, 1024, device='cuda',
                dtype=torch.float16)

# Hardware accelerated GEMM
# FP16 input → FP32 output
c = tensor_core_gemm(a, b)
print(c.dtype)  # torch.float32

GPU Architecture Support

Optimized for Ampere (A100, RTX 30) and newer. Forward compatibility with Hopper and future architectures.

Architecture	Tensor Core	Status
Ampere (A100, RTX 30/40)	WMMA with FP16, BF16, TF32	✅ Primary target
Hopper (H100)	WMMA with FP16, BF16, FP8	✅ Supported
Volta (V100)	WMMA with FP16	⚠️ Limited
Turing (T4, RTX 20)	WMMA with FP16, INT8	⚠️ Limited

Documentation

Comprehensive guides in English and Chinese

🚀

Start using LLM-Speed

Three clear paths to get value from this project:

🚀

LLM-Speed

Key Features

FlashAttention

Tensor Core GEMM

PyTorch Integration

Double Buffering

Bank Conflict Free

Property Testing

Memory-Efficient Design

Quick Example

GPU Architecture Support

Documentation

Quick Start

API Reference

Architecture

Performance Guide

Start using LLM-Speed

Get Started

Understand Architecture

Benchmark Locally