v2.0.0 — Major Refactoring

Release Date: March 9, 2026
Full Changelog: v1.0.0 → v2.0.0


⚠️ Breaking Changes

KVCache API Redesign

Problem: The previous appendKV() implementation had fragile layer-order dependencies that could lead to incorrect cache writes if layers were called in different orders.

Solution: New stateless design with explicit length advancement.

Before (v1.x)

class="highlight">
1
2
3
// Layer 0 would update current_len, other layers compensated
// Could break if layer order changed
kv_cache.appendKV(seq_id, layer_idx, k, v, num_tokens);

After (v2.0+)

class="highlight">
1
2
3
4
5
6
// appendKV is stateless - all layers write at current_len
for (int i = 0; i < num_layers; i++) {
    layers[i]->forward(hidden_states, kv_cache, seq_id, position, stream);
}
// Explicitly advance length once after all layers
kv_cache.advanceSeqLen(seq_id, num_tokens);

Migration: Update any code using KVCacheManager directly. See Migration Guide below.


🟢 Added

CI/CD Improvements

  • GitHub Actions workflow for continuous integration
  • Automated clang-format checking
  • Format validation on pull requests

CMake Modernization

Feature Before After
Version 1.0.0 2.0.0
CUDA Arch Manual Auto-detect (native or fallback)
Includes Global target_include_directories()
Target Export None tiny_llm::tiny_llm alias
Warnings Basic -Wall -Wextra (GCC/Clang)
IDE Support Manual compile_commands.json generation

New usage:

class="highlight">
1
2
find_package(tiny_llm)
target_link_libraries(myapp tiny_llm::tiny_llm)

🟡 Changed

Build System

  • Minimum CMake version: 3.18
  • CUDA architecture auto-detection with fallback to common arches
  • Improved compiler warning flags

Test Coverage

  • Added property-based tests with RapidCheck
  • Expanded kernel test coverage
  • Integration tests for end-to-end workflows

🔄 Migration Guide

Updating KVCache Usage

If you’re using KVCacheManager directly in your code:

class="highlight">
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// v1.x code
void generateStep() {
    for (int i = 0; i < num_layers; i++) {
        // appendKV managed length internally
        kv_cache.appendKV(seq_id, i, k_data[i], v_data[i], 1);
    }
}

// v2.0+ code
void generateStep() {
    for (int i = 0; i < num_layers; i++) {
        // appendKV is stateless
        kv_cache.appendKV(seq_id, i, k_data[i], v_data[i], 1, stream_);
    }
    // Must explicitly advance
    kv_cache.advanceSeqLen(seq_id, 1);
}

The InferenceEngine class handles this automatically for standard use cases.


📊 Performance

Metric v1.0.0 v2.0.0 Change
Build time 45s 38s -15%
Test runtime 2.1s 1.8s -14%
Memory (KV Cache) Same Same Correctness only
Throughput Same Same No impact

✅ Verification

class="highlight">
1
2
3
4
5
$ ctest --output-on-failure
100% tests passed, 0 tests failed

$ clang-format --dry-run --Werror src/*.cpp tests/*.cpp kernels/*.cu
$ # No output = no format issues

📚 Documentation

New documentation structure:

  • Multi-language support (EN/ZH)
  • API reference with examples
  • Architecture documentation
  • Contribution guidelines

← Back to Changelog


Back to top