Quick Start

Name: Tiny-LLM
Author: LessUp

Get up and running with Tiny-LLM inference engine in minutes.

Prerequisites
Installation
Quick Example
Model Format
Next Steps

Prerequisites

System Requirements

Component	Minimum	Recommended
CUDA Toolkit	11.0	12.0+
CMake	3.18	3.25+
C++ Compiler	GCC 9+ / Clang 10+	GCC 11+ / Clang 14+
GPU Compute Capability	SM 7.0 (Volta)	SM 8.0+ (Ampere+)
GPU Memory	4 GB	8 GB+

Verify Your GPU

class="highlight">

1
2
3
4
5
6
# Check CUDA version
nvcc --version

# Check GPU compute capability
nvidia-smi --query-gpu=compute_cap --format=csv
# Output should be 7.0 or higher
   Installation 
   1. Clone Repository 
 class="highlight">1
2
git clone https://github.com/LessUp/tiny-llm.git
cd tiny-llm
   2. Configure Build 
 class="highlight">1
2
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
   CMake Options 
    Option  Default  Description  
 
   CMAKE_BUILD_TYPE  Release  Build type: Debug/Release/RelWithDebInfo  
  BUILD_TESTS  ON  Build test suite  
  CUDA_ARCH  native  Target CUDA architectures (e.g., 75;80;86)  
 
 
   3. Build 
 class="highlight">1
make -j$(nproc)
   4. Run Tests 
 class="highlight">1
ctest --output-on-failure
   5. Run Demo 
 class="highlight">1
./tiny_llm_demo
   Quick Example 
   Complete Inference Example 
 class="highlight">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
#include <tiny_llm/inference_engine.h>
#include <iostream>

int main() {
    // 1. Configure model
    ModelConfig config;
    config.vocab_size = 32000;
    config.hidden_dim = 4096;
    config.num_layers = 32;
    config.num_heads = 32;
    config.num_kv_heads = 32;      // GQA: use 8 for modern models
    config.head_dim = 128;
    config.intermediate_dim = 11008;
    config.max_seq_len = 2048;
    config.rope_theta = 10000.0f;
    
    // 2. Load model
    auto result = InferenceEngine::load("path/to/model.bin", config);
    if (result.isErr()) {
        std::cerr << "Failed to load model: " << result.error() << std::endl;
        return 1;
    }
    auto engine = std::move(result.value());
    
    // 3. Configure generation
    GenerationConfig gen_config;
    gen_config.max_new_tokens = 256;
    gen_config.temperature = 0.7f;
    gen_config.top_p = 0.9f;
    gen_config.top_k = 50;
    gen_config.do_sample = true;
    
    // 4. Generate
    std::vector<int> prompt = {1, 15043, 29892};  // "Hello," tokens
    auto output = engine->generate(prompt, gen_config);
    
    // 5. Check statistics
    const auto& stats = engine->getStats();
    std::cout << "Generated " << stats.tokens_generated << " tokens\n"
              << "Speed: " << stats.tokens_per_second << " tok/s\n"
              << "Peak memory: " << stats.peak_memory_bytes / 1024 / 1024 << " MB\n";
    
    return 0;
}
   Using KV Cache Directly 
 class="highlight">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#include <tiny_llm/kv_cache.h>

// Create cache manager
KVCacheConfig cache_config;
cache_config.num_layers = 32;
cache_config.num_heads = 32;
cache_config.head_dim = 128;
cache_config.max_seq_len = 2048;
cache_config.max_batch_size = 1;

KVCacheManager kv_cache(cache_config);

// Allocate a sequence
auto seq_result = kv_cache.allocateSequence(1024);
if (seq_result.isErr()) {
    std::cerr << "Failed to allocate: " << seq_result.error() << std::endl;
    return 1;
}
int seq_id = seq_result.value();

// Use in transformer layers
for (auto& layer : layers) {
    layer.forward(hidden_states, kv_cache, seq_id, position, stream);
}

// After all layers, advance sequence length
kv_cache.advanceSeqLen(seq_id, 1);

// Release when done
kv_cache.releaseSequence(seq_id);
   Model Format 
   Custom Binary Format 
 Tiny-LLM currently uses a custom binary format with the following layout:
 class="highlight">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
┌─────────────────┬─────────────────────────────────────┐
│ Header (256B)   │ magic, version, config              │
├─────────────────┼─────────────────────────────────────┤
│ Token Embedding │ [vocab_size, hidden_dim] FP16       │
├─────────────────┼─────────────────────────────────────┤
│ Layer 0 Weights │ Attention + FFN weights (INT8)      │
│                 │ Scales (FP16)                       │
├─────────────────┼─────────────────────────────────────┤
│ ...             │                                     │
├─────────────────┼─────────────────────────────────────┤
│ Layer N-1       │                                     │
├─────────────────┼─────────────────────────────────────┤
│ Output Norm     │ [hidden_dim] FP16                   │
│ LM Head         │ [hidden_dim, vocab_size] FP16       │
└─────────────────┴─────────────────────────────────────┘
   Creating Model Files 
 See Developer Guide for instructions on converting models to Tiny-LLM format.
   Next Steps 
  Architecture — Understand system design and components
 API Reference — Complete API documentation
 Benchmarks — Performance characteristics
 Troubleshooting — Common issues and solutions
 
    Languages: English  中文  API →  
 
 
    ← Home  Architecture →  
 
 
  
  Back to top
     
   Quick Links
   Documentation
  Quick Start
 Architecture
 API Reference
 
 
  Resources
  Changelog
 Releases
 Troubleshooting
 
 
  Community
  GitHub
 Contributing
 Developer Guide
 
 
 
 
    📦 New version available!

Quick Start

Table of Contents

Prerequisites

System Requirements

Verify Your GPU

Installation

1. Clone Repository

2. Configure Build

CMake Options

3. Build

4. Run Tests

5. Run Demo

Quick Example

Complete Inference Example

Using KV Cache Directly

Model Format

Custom Binary Format

Creating Model Files

Next Steps