CUDA-native
Transformer Inference

A focused C++/CUDA inference engine built around W8A16 quantization, explicit KV cache management, and hand-tuned CUDA kernels. Small runtime surface, predictable architecture, and a public workflow that stays aligned with the code.

~50% Memory Reduction
|
KV Cache Incremental Decoding
|
OpenSpec Governed Repository
example.cpp
#include <tiny_llm/inference_engine.h>

// Configure model
ModelConfig config;
config.vocab_size = 32000;
config.hidden_dim = 4096;
config.num_layers = 32;

// Load with W8A16 weights
auto engine = InferenceEngine::load(
    "model.bin", config).value();

// Generate with KV cache
GenerationConfig gen;
gen.max_new_tokens = 256;
gen.temperature = 0.7f;

auto output = engine.generate(prompt, gen);

Features

W8A16 Quantization

INT8 weights with FP16 activations deliver ~50% memory reduction while maintaining inference quality.

Stable
💾

Efficient KV Cache

State-of-the-art key-value cache with O(1) incremental decoding and dynamic allocation.

Stable
🔧

Optimized CUDA Kernels

Hand-tuned kernels with shared-memory tiling and warp-level primitives tuned for inference workloads.

Stable
🎲

Advanced Sampling

Greedy, temperature, top-k, and top-p sampling implemented as reusable engine utilities.

Stable

Quick Start

Requirements

  • NVIDIA GPU: Compute Capability 7.0+ (Volta or newer)
  • CUDA Toolkit: 11.0 or higher
  • CMake: 3.18 or higher
  • C++ Compiler: GCC 9+ or Clang 10+

Installation

class="highlight">
1
2
3
4
5
6
7
8
9
10
# Clone repository
git clone https://github.com/LessUp/tiny-llm.git
cd tiny-llm

# Build
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTS=ON
cmake --build build -j$(nproc)

# Run tests
ctest --test-dir build --output-on-failure --timeout 300

Documentation

🚀

Quick Start

Get up and running in minutes

🏗️

Architecture

System design and components

📖

API Reference

Complete API documentation

🔧

Developer Guide

Development and contribution

Benchmarks

Performance metrics and profiling

🔍

Troubleshooting

Common issues and solutions


Language Support

Documentation available in multiple languages:


Engineering Highlights

Quantization Path W8A16 cuts weight memory
Kernel Path CUDA-native kernel path
Repository Workflow OpenSpec + targeted validation

Contributing

Tiny-LLM accepts focused contributions. Start with the OpenSpec-aware Contributing Guide before broad edits.


License

Distributed under the MIT License. See LICENSE for more information.


Back to top