v2.0.2 Read release notes →

CUDA-native
Transformer Inference

A focused C++/CUDA inference engine built around W8A16 quantization, explicit KV cache management, and hand-tuned CUDA kernels. Small runtime surface, predictable architecture, and a public workflow that stays aligned with the code.

Get Started View on GitHub

~50% Memory Reduction

KV Cache Incremental Decoding

OpenSpec Governed Repository

example.cpp

 #include <tiny_llm/inference_engine.h>

// Configure model
ModelConfig config;
config.vocab_size = 32000;
config.hidden_dim = 4096;
config.num_layers = 32;

// Load with W8A16 weights
auto engine = InferenceEngine::load(
    "model.bin", config).value();

// Generate with KV cache
GenerationConfig gen;
gen.max_new_tokens = 256;
gen.temperature = 0.7f;

auto output = engine.generate(prompt, gen); 

Features

⚡

W8A16 Quantization

INT8 weights with FP16 activations deliver ~50% memory reduction while maintaining inference quality.

Stable

💾

Efficient KV Cache

State-of-the-art key-value cache with O(1) incremental decoding and dynamic allocation.

Stable

🔧

Optimized CUDA Kernels

Hand-tuned kernels with shared-memory tiling and warp-level primitives tuned for inference workloads.

Stable

🎲

Advanced Sampling

Greedy, temperature, top-k, and top-p sampling implemented as reusable engine utilities.

Stable

Quick Start

Requirements

NVIDIA GPU: Compute Capability 7.0+ (Volta or newer)
CUDA Toolkit: 11.0 or higher
CMake: 3.18 or higher
C++ Compiler: GCC 9+ or Clang 10+

Installation

class="highlight">

1
2
3
4
5
6
7
8
9
10
# Clone repository
git clone https://github.com/LessUp/tiny-llm.git
cd tiny-llm

# Build
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTS=ON
cmake --build build -j$(nproc)

# Run tests
ctest --test-dir build --output-on-failure --timeout 300
    Ready to dive deeper? 
  Read Full Quickstart Explore Architecture 
 
   Documentation 
   🚀
   Quick Start 
 Get up and running in minutes
   🏗️
   Architecture 
 System design and components
   📖
   API Reference 
 Complete API documentation
   🔧
   Developer Guide 
 Development and contribution
   ⚡
   Benchmarks 
 Performance metrics and profiling
   🔍
   Troubleshooting 
 Common issues and solutions
  
   Language Support 
  Documentation available in multiple languages:
   🇺🇸 English   🇨🇳 简体中文  
 
   Engineering Highlights 
    
 
  Quantization Path W8A16 cuts weight memory 
 
   
 
  Kernel Path CUDA-native kernel path 
 
   
 
  Repository Workflow OpenSpec + targeted validation 
 
 
   Contributing 
 Tiny-LLM accepts focused contributions. Start with the OpenSpec-aware Contributing Guide before broad edits.
   🐛 Report Issues   🔄 Submit PRs   💬 Discussions  
   License 
 Distributed under the MIT License. See LICENSE for more information.
  
  Back to top

CUDA-native Transformer Inference

Features

W8A16 Quantization

Efficient KV Cache

Optimized CUDA Kernels

Advanced Sampling

Quick Start

Requirements

Installation

Ready to dive deeper?

Documentation

Quick Start

Architecture

API Reference

Developer Guide

Benchmarks

Troubleshooting

Language Support

Engineering Highlights

Contributing

License

CUDA-native
Transformer Inference