Quick Start

1. Clone the Repository

git clone https://github.com/LessUp/mini-inference-engine.git
cd mini-inference-engine

2. Build the Project (Recommended)

This project uses CMake Presets to simplify the build process:

# Debug build (includes tests, enables assertions)
cmake --preset default
cmake --build --preset default

# Release build (optimized performance)
cmake --preset release
cmake --build --preset release

3. Verify Installation

# Run unit tests
ctest --preset default

# Run performance benchmarks
./build-release/benchmark

# Run MNIST demo (optional)
./build-release/mnist_demo

Build the Project

Using CMake Presets (Recommended)

Preset	Purpose	Configuration
`default`	Development & debugging	Debug mode, enables tests
`release`	Performance testing	Release mode, O3 optimization
`ci`	Continuous integration	Strict warnings, test coverage

# List available presets
cmake --list-presets

# Use specific preset
cmake --preset <preset-name>
cmake --build --preset <preset-name>

Manual Build

mkdir build && cd build

# Configure
cmake .. -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTS=ON

# Compile (using all available cores)
make -j$(nproc)

# Run tests
ctest --output-on-failure

Build Options

Option	Description	Default
`BUILD_TESTS`	Build unit tests	`ON`
`BUILD_BENCHMARKS`	Build benchmarks	`ON`
`BUILD_MNIST_DEMO`	Build MNIST demo	`ON`
`CMAKE_CUDA_ARCHITECTURES`	GPU architecture	Native architecture

cmake .. -DBUILD_TESTS=ON -DBUILD_BENCHMARKS=ON

Run Tests

Run All Tests

ctest --preset default

Run Specific Tests

# Run GEMM-related tests
./build/tests --gtest_filter="GemmTest*"

# Run Tensor tests
./build/tests --gtest_filter="TensorTest*"

# Run specific test case
./build/tests --gtest_filter="GemmTest.NaiveMatMulCorrectness"

Test Coverage

# Generate coverage report (requires gcov/lcov)
cmake --preset ci
cmake --build --preset ci
ctest --preset ci

Your First Program

Basic GEMM Example

Create file first_gemm.cpp:

#include "common.h"
#include "kernels.cuh"
#include <iostream>
#include <vector>

int main() {
    // Set GPU device
    CUDA_CHECK(cudaSetDevice(0));
    
    // Define matrix dimensions
    const int M = 1024, N = 1024, K = 1024;
    
    // Allocate GPU memory
    DeviceMemory d_A(M * K * sizeof(float));
    DeviceMemory d_B(K * N * sizeof(float));
    DeviceMemory d_C(M * N * sizeof(float));
    
    // Create and initialize host data
    std::vector<float> h_A(M * K), h_B(K * N);
    random_init(h_A.data(), h_A.size());
    random_init(h_B.data(), h_B.size());
    
    // Copy to GPU
    d_A.copy_from_host(h_A.data(), M * K * sizeof(float));
    d_B.copy_from_host(h_B.data(), K * N * sizeof(float));
    
    // Execute optimized GEMM
    launch_optimized_gemm(d_A.get(), d_B.get(), d_C.get(), M, N, K);
    
    // Synchronize
    CUDA_CHECK(cudaDeviceSynchronize());
    
    // Get results
    std::vector<float> h_C(M * N);
    d_C.copy_to_host(h_C.data(), M * N * sizeof(float));
    
    std::cout << "✓ GEMM completed! C[0] = " << h_C[0] << std::endl;
    
    return 0;
}

Compile and Run

# Add file to CMakeLists.txt as executable target
# Or compile manually:
nvcc -o first_gemm first_gemm.cpp \
    -I./include -L./build -lmini_inference \
    -lcudart -lcublas -std=c++17

./first_gemm

Verify Correctness

#include "common.h"

// Add verification code
std::vector<float> h_C_ref(M * N);
cpu_matmul(h_A.data(), h_B.data(), h_C_ref.data(), M, N, K);

float max_error = compare_matrices(h_C.data(), h_C_ref.data(), M * N);
std::cout << "Max error: " << max_error << std::endl;
// Should be < 1e-4

MNIST Demo

MNIST demo shows how to use the inference engine for handwritten digit recognition.

Prepare Weights File

# Use Python script to export weights
cd scripts
python export_mnist_weights.py --output ../weights/mnist_model.bin

Run Demo

./build-release/mnist_demo --weights weights/mnist_model.bin

Expected Output

Loading weights from: weights/mnist_model.bin
Network Info:
  Layers: 3
  Input dim: 784
  Output dim: 10

Running inference on batch of 32 samples...
Sample 0: Predicted digit 7, Confidence 92.3%
Sample 1: Predicted digit 2, Confidence 88.7%
...
Average inference time: 0.45 ms

Performance Benchmarking

Run Benchmarks

# Run full benchmark
./build-release/benchmark

# Specify matrix size
./build-release/benchmark --m 2048 --n 2048 --k 2048

# Specify kernel type
./build-release/benchmark --kernel optimized

Expected Performance (RTX 3080, 1024×1024×1024)

Kernel	Time (ms)	GFLOPS	vs cuBLAS
cuBLAS	0.31	6920	100%
Naive	3.10	694	10%
Tiled	1.55	1388	20%
Coalesced	1.03	2088	30%
Double Buffer	0.78	2768	40%
Optimized	0.44	4870	70%
Fused	0.38	5630	81%
Vectorized	0.35	6130	85%

Troubleshooting

Compile Error “Unsupported gpu architecture”

Solution: Modify GPU architecture setting in CMakeLists.txt:

# Check GPU architecture
nvidia-smi --query-gpu=compute_cap --format=csv

# Set corresponding architecture
set(CMAKE_CUDA_ARCHITECTURES 86)  # RTX 30 series
set(CMAKE_CUDA_ARCHITECTURES 89)  # RTX 40 series

Runtime Error “CUDA out of memory”

Solution:

// 1. Reduce matrix size
const int M = 512, N = 512, K = 512;

// 2. Clear memory pool cache
MemoryPool::instance().clear_cache();

// 3. Check GPU memory usage
nvidia-smi

Lower Than Expected Performance

Checklist:

GPU power state is P0:

nvidia-smi -q -d PERFORMANCE | grep "Performance State"

No other programs using GPU
Built in Release mode
Matrix size is power of 2 (aligned)

Test Failures

# Run single test for detailed error
./build/tests --gtest_filter="GemmTest.NaiveMatMulCorrectness" --gtest_also_run_disabled_tests

# Use CUDA memory checker
cuda-memcheck ./build/tests --gtest_filter="GemmTest*"

Next Steps

Congratulations! You’ve completed the quick start. Next you can:

📖 Read Architecture Design to understand system principles
⚡ Study GEMM Optimization Guide to master optimization techniques
🔧 Check API Reference for complete interface documentation
📊 Read Performance Tuning Guide for advanced optimization

*Last Updated: 2025-04-16

Document Version: v1.1.0*

Component	Minimum	Recommended
GPU	NVIDIA GPU, Compute Capability 7.0+	RTX 30 series or higher
VRAM	4 GB	8 GB+
System Memory	8 GB	16 GB+
Operating System	Linux / Windows / macOS	Ubuntu 22.04 LTS

Dependency	Minimum	Recommended
CUDA Toolkit	11.0	12.0+
CMake	3.18	3.25+
GCC	9.0	11.0+
Python	3.8+	3.10+

Table of Contents

System Requirements

Hardware Requirements

Software Requirements

Verify Environment