CUDA-native
Transformer Inference
A focused C++/CUDA inference engine built around W8A16 quantization, explicit KV cache management, and hand-tuned CUDA kernels. Small runtime surface, predictable architecture, and a public workflow that stays aligned with the code.
#include <tiny_llm/inference_engine.h>
// Configure model
ModelConfig config;
config.vocab_size = 32000;
config.hidden_dim = 4096;
config.num_layers = 32;
// Load with W8A16 weights
auto engine = InferenceEngine::load(
"model.bin", config).value();
// Generate with KV cache
GenerationConfig gen;
gen.max_new_tokens = 256;
gen.temperature = 0.7f;
auto output = engine.generate(prompt, gen);Features
W8A16 Quantization
INT8 weights with FP16 activations deliver ~50% memory reduction while maintaining inference quality.
StableEfficient KV Cache
State-of-the-art key-value cache with O(1) incremental decoding and dynamic allocation.
StableQuick Start
Requirements
- NVIDIA GPU: Compute Capability 7.0+ (Volta or newer)
- CUDA Toolkit: 11.0 or higher
- CMake: 3.18 or higher
- C++ Compiler: GCC 9+ or Clang 10+
Installation
1
2
3
4
5
6
7
8
9
10
# Clone repository
git clone https://github.com/LessUp/tiny-llm.git
cd tiny-llm
# Build
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTS=ON
cmake --build build -j$(nproc)
# Run tests
ctest --test-dir build --output-on-failure --timeout 300
Documentation
Quick Start
Get up and running in minutes
Architecture
System design and components
API Reference
Complete API documentation
Developer Guide
Development and contribution
Benchmarks
Performance metrics and profiling
Troubleshooting
Common issues and solutions
Language Support
Documentation available in multiple languages:
Engineering Highlights
Quantization Path W8A16 cuts weight memory Kernel Path CUDA-native kernel path Repository Workflow OpenSpec + targeted validation
Contributing
Tiny-LLM accepts focused contributions. Start with the OpenSpec-aware Contributing Guide before broad edits.
License
Distributed under the MIT License. See LICENSE for more information.