Language: English 简体中文

Table of Contents

Sections

System Requirements

Hardware Requirements

Component Minimum Recommended
GPU NVIDIA GPU, Compute Capability 7.0+ RTX 30 series or higher
VRAM 4 GB 8 GB+
System Memory 8 GB 16 GB+
Operating System Linux / Windows / macOS Ubuntu 22.04 LTS

Software Requirements

Dependency Minimum Recommended
CUDA Toolkit 11.0 12.0+
CMake 3.18 3.25+
GCC 9.0 11.0+
Python 3.8+ 3.10+

Verify Environment

1
2
3
4
5
6
7
8
9
10
11
# Check CUDA version
nvcc --version

# Check GPU info
nvidia-smi

# Check CMake version
cmake --version

# Check GCC version
gcc --version

Quick Start

1. Clone the Repository

1
2
git clone https://github.com/LessUp/mini-inference-engine.git
cd mini-inference-engine

This project uses CMake Presets to simplify the build process:

1
2
3
4
5
6
7
# Debug build (includes tests, enables assertions)
cmake --preset default
cmake --build --preset default

# Release build (optimized performance)
cmake --preset release
cmake --build --preset release

3. Verify Installation

1
2
3
4
5
6
7
8
# Run unit tests
ctest --preset default

# Run performance benchmarks
./build-release/benchmark

# Run MNIST demo (optional)
./build-release/mnist_demo

Build the Project

Preset Purpose Configuration
default Development & debugging Debug mode, enables tests
release Performance testing Release mode, O3 optimization
ci Continuous integration Strict warnings, test coverage
1
2
3
4
5
6
# List available presets
cmake --list-presets

# Use specific preset
cmake --preset <preset-name>
cmake --build --preset <preset-name>

Manual Build

1
2
3
4
5
6
7
8
9
10
mkdir build && cd build

# Configure
cmake .. -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTS=ON

# Compile (using all available cores)
make -j$(nproc)

# Run tests
ctest --output-on-failure

Build Options

Option Description Default
BUILD_TESTS Build unit tests ON
BUILD_BENCHMARKS Build benchmarks ON
BUILD_MNIST_DEMO Build MNIST demo ON
CMAKE_CUDA_ARCHITECTURES GPU architecture Native architecture
1
cmake .. -DBUILD_TESTS=ON -DBUILD_BENCHMARKS=ON

Run Tests

Run All Tests

1
ctest --preset default

Run Specific Tests

1
2
3
4
5
6
7
8
# Run GEMM-related tests
./build/tests --gtest_filter="GemmTest*"

# Run Tensor tests
./build/tests --gtest_filter="TensorTest*"

# Run specific test case
./build/tests --gtest_filter="GemmTest.NaiveMatMulCorrectness"

Test Coverage

1
2
3
4
# Generate coverage report (requires gcov/lcov)
cmake --preset ci
cmake --build --preset ci
ctest --preset ci

Your First Program

Basic GEMM Example

Create file first_gemm.cpp:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
#include "common.h"
#include "kernels.cuh"
#include <iostream>
#include <vector>

int main() {
    // Set GPU device
    CUDA_CHECK(cudaSetDevice(0));
    
    // Define matrix dimensions
    const int M = 1024, N = 1024, K = 1024;
    
    // Allocate GPU memory
    DeviceMemory d_A(M * K * sizeof(float));
    DeviceMemory d_B(K * N * sizeof(float));
    DeviceMemory d_C(M * N * sizeof(float));
    
    // Create and initialize host data
    std::vector<float> h_A(M * K), h_B(K * N);
    random_init(h_A.data(), h_A.size());
    random_init(h_B.data(), h_B.size());
    
    // Copy to GPU
    d_A.copy_from_host(h_A.data(), M * K * sizeof(float));
    d_B.copy_from_host(h_B.data(), K * N * sizeof(float));
    
    // Execute optimized GEMM
    launch_optimized_gemm(d_A.get(), d_B.get(), d_C.get(), M, N, K);
    
    // Synchronize
    CUDA_CHECK(cudaDeviceSynchronize());
    
    // Get results
    std::vector<float> h_C(M * N);
    d_C.copy_to_host(h_C.data(), M * N * sizeof(float));
    
    std::cout << "✓ GEMM completed! C[0] = " << h_C[0] << std::endl;
    
    return 0;
}

Compile and Run

1
2
3
4
5
6
7
# Add file to CMakeLists.txt as executable target
# Or compile manually:
nvcc -o first_gemm first_gemm.cpp \
    -I./include -L./build -lmini_inference \
    -lcudart -lcublas -std=c++17

./first_gemm

Verify Correctness

1
2
3
4
5
6
7
8
9
#include "common.h"

// Add verification code
std::vector<float> h_C_ref(M * N);
cpu_matmul(h_A.data(), h_B.data(), h_C_ref.data(), M, N, K);

float max_error = compare_matrices(h_C.data(), h_C_ref.data(), M * N);
std::cout << "Max error: " << max_error << std::endl;
// Should be < 1e-4

MNIST Demo

MNIST demo shows how to use the inference engine for handwritten digit recognition.

Prepare Weights File

1
2
3
# Use Python script to export weights
cd scripts
python export_mnist_weights.py --output ../weights/mnist_model.bin

Run Demo

1
./build-release/mnist_demo --weights weights/mnist_model.bin

Expected Output

1
2
3
4
5
6
7
8
9
10
11
Loading weights from: weights/mnist_model.bin
Network Info:
  Layers: 3
  Input dim: 784
  Output dim: 10

Running inference on batch of 32 samples...
Sample 0: Predicted digit 7, Confidence 92.3%
Sample 1: Predicted digit 2, Confidence 88.7%
...
Average inference time: 0.45 ms

Performance Benchmarking

Run Benchmarks

1
2
3
4
5
6
7
8
# Run full benchmark
./build-release/benchmark

# Specify matrix size
./build-release/benchmark --m 2048 --n 2048 --k 2048

# Specify kernel type
./build-release/benchmark --kernel optimized

Expected Performance (RTX 3080, 1024×1024×1024)

Kernel Time (ms) GFLOPS vs cuBLAS
cuBLAS 0.31 6920 100%
Naive 3.10 694 10%
Tiled 1.55 1388 20%
Coalesced 1.03 2088 30%
Double Buffer 0.78 2768 40%
Optimized 0.44 4870 70%
Fused 0.38 5630 81%
Vectorized 0.35 6130 85%

Troubleshooting

Compile Error “Unsupported gpu architecture”

Solution: Modify GPU architecture setting in CMakeLists.txt:

1
2
3
4
5
6
# Check GPU architecture
nvidia-smi --query-gpu=compute_cap --format=csv

# Set corresponding architecture
set(CMAKE_CUDA_ARCHITECTURES 86)  # RTX 30 series
set(CMAKE_CUDA_ARCHITECTURES 89)  # RTX 40 series

Runtime Error “CUDA out of memory”

Solution:

1
2
3
4
5
6
7
8
// 1. Reduce matrix size
const int M = 512, N = 512, K = 512;

// 2. Clear memory pool cache
MemoryPool::instance().clear_cache();

// 3. Check GPU memory usage
nvidia-smi

Lower Than Expected Performance

Checklist:

  • GPU power state is P0:
    1
    
    nvidia-smi -q -d PERFORMANCE | grep "Performance State"
    
  • No other programs using GPU
  • Built in Release mode
  • Matrix size is power of 2 (aligned)

Test Failures

1
2
3
4
5
# Run single test for detailed error
./build/tests --gtest_filter="GemmTest.NaiveMatMulCorrectness" --gtest_also_run_disabled_tests

# Use CUDA memory checker
cuda-memcheck ./build/tests --gtest_filter="GemmTest*"

Next Steps

Congratulations! You’ve completed the quick start. Next you can:

  1. 📖 Read Architecture Design to understand system principles
  2. ⚡ Study GEMM Optimization Guide to master optimization techniques
  3. 🔧 Check API Reference for complete interface documentation
  4. 📊 Read Performance Tuning Guide for advanced optimization


*Last Updated: 2025-04-16 Document Version: v1.1.0*

Back to top

MIT License | A learning project for the CUDA community