Quick Start
Get up and running with Tiny-LLM inference engine in minutes.
Table of Contents
Prerequisites
System Requirements
| Component | Minimum | Recommended |
|---|---|---|
| CUDA Toolkit | 11.0 | 12.0+ |
| CMake | 3.18 | 3.25+ |
| C++ Compiler | GCC 9+ / Clang 10+ | GCC 11+ / Clang 14+ |
| GPU Compute Capability | SM 7.0 (Volta) | SM 8.0+ (Ampere+) |
| GPU Memory | 4 GB | 8 GB+ |
Verify Your GPU
class="highlight">
1
2
3
4
5
6
# Check CUDA version
nvcc --version
# Check GPU compute capability
nvidia-smi --query-gpu=compute_cap --format=csv
# Output should be 7.0 or higher
Installation
1. Clone Repository
class="highlight">1
2
git clone https://github.com/LessUp/tiny-llm.git
cd tiny-llm
2. Configure Build
class="highlight">1
2
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
CMake Options
Option Default Description CMAKE_BUILD_TYPE Release Build type: Debug/Release/RelWithDebInfo BUILD_TESTS ON Build test suite CUDA_ARCH native Target CUDA architectures (e.g., 75;80;86)
3. Build
class="highlight">1
make -j$(nproc)
4. Run Tests
class="highlight">1
ctest --output-on-failure
5. Run Demo
class="highlight">1
./tiny_llm_demo
Quick Example
Complete Inference Example
class="highlight">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
#include <tiny_llm/inference_engine.h>
#include <iostream>
int main() {
// 1. Configure model
ModelConfig config;
config.vocab_size = 32000;
config.hidden_dim = 4096;
config.num_layers = 32;
config.num_heads = 32;
config.num_kv_heads = 32; // GQA: use 8 for modern models
config.head_dim = 128;
config.intermediate_dim = 11008;
config.max_seq_len = 2048;
config.rope_theta = 10000.0f;
// 2. Load model
auto result = InferenceEngine::load("path/to/model.bin", config);
if (result.isErr()) {
std::cerr << "Failed to load model: " << result.error() << std::endl;
return 1;
}
auto engine = std::move(result.value());
// 3. Configure generation
GenerationConfig gen_config;
gen_config.max_new_tokens = 256;
gen_config.temperature = 0.7f;
gen_config.top_p = 0.9f;
gen_config.top_k = 50;
gen_config.do_sample = true;
// 4. Generate
std::vector<int> prompt = {1, 15043, 29892}; // "Hello," tokens
auto output = engine->generate(prompt, gen_config);
// 5. Check statistics
const auto& stats = engine->getStats();
std::cout << "Generated " << stats.tokens_generated << " tokens\n"
<< "Speed: " << stats.tokens_per_second << " tok/s\n"
<< "Peak memory: " << stats.peak_memory_bytes / 1024 / 1024 << " MB\n";
return 0;
}
Using KV Cache Directly
class="highlight">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#include <tiny_llm/kv_cache.h>
// Create cache manager
KVCacheConfig cache_config;
cache_config.num_layers = 32;
cache_config.num_heads = 32;
cache_config.head_dim = 128;
cache_config.max_seq_len = 2048;
cache_config.max_batch_size = 1;
KVCacheManager kv_cache(cache_config);
// Allocate a sequence
auto seq_result = kv_cache.allocateSequence(1024);
if (seq_result.isErr()) {
std::cerr << "Failed to allocate: " << seq_result.error() << std::endl;
return 1;
}
int seq_id = seq_result.value();
// Use in transformer layers
for (auto& layer : layers) {
layer.forward(hidden_states, kv_cache, seq_id, position, stream);
}
// After all layers, advance sequence length
kv_cache.advanceSeqLen(seq_id, 1);
// Release when done
kv_cache.releaseSequence(seq_id);
Model Format
Custom Binary Format
Tiny-LLM currently uses a custom binary format with the following layout:
class="highlight">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
┌─────────────────┬─────────────────────────────────────┐
│ Header (256B) │ magic, version, config │
├─────────────────┼─────────────────────────────────────┤
│ Token Embedding │ [vocab_size, hidden_dim] FP16 │
├─────────────────┼─────────────────────────────────────┤
│ Layer 0 Weights │ Attention + FFN weights (INT8) │
│ │ Scales (FP16) │
├─────────────────┼─────────────────────────────────────┤
│ ... │ │
├─────────────────┼─────────────────────────────────────┤
│ Layer N-1 │ │
├─────────────────┼─────────────────────────────────────┤
│ Output Norm │ [hidden_dim] FP16 │
│ LM Head │ [hidden_dim, vocab_size] FP16 │
└─────────────────┴─────────────────────────────────────┘
Creating Model Files
See Developer Guide for instructions on converting models to Tiny-LLM format.
Next Steps
- Architecture — Understand system design and components
- API Reference — Complete API documentation
- Benchmarks — Performance characteristics
- Troubleshooting — Common issues and solutions