Language: English 简体中文

Table of Contents


System Architecture Overview

Layered Architecture

Mini-Inference Engine adopts a clear layered architecture design:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
┌─────────────────────────────────────────────────────────────────────────┐
│                           Application Layer                              │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐  ┌────────────┐         │
│  │ Benchmark  │  │ MNIST Demo │  │   Tests    │  │  User App  │         │
│  └────────────┘  └────────────┘  └────────────┘  └────────────┘         │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                           Engine Layer                                   │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │                      InferenceEngine                               │  │
│  │  - Weight Loading    - Forward Pass    - Layer Management         │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐  ┌────────────┐         │
│  │   Tensor   │  │ AutoTuner  │  │  Profiler  │  │   Config   │         │
│  └────────────┘  └────────────┘  └────────────┘  └────────────┘         │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                           Kernel Layer                                   │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐       │
│  │   Naive  │ │   Tiled  │ │Coalesced │ │  Double  │ │Optimized │       │
│  │ MatMul   │ │   GEMM   │ │   GEMM   │ │  Buffer  │ │   GEMM   │       │
│  └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘       │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐                     │
│  │   Fused  │ │Vectorized│ │   Half   │ │ Batched  │                     │
│  │   GEMM   │ │   GEMM   │ │   GEMM   │ │   GEMM   │                     │
│  └──────────┘ └──────────┘ └──────────┘ └──────────┘                     │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                         Infrastructure Layer                             │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐  ┌────────────┐         │
│  │ MemoryPool │  │StreamMgr   │  │   Logger   │  │Quantization│         │
│  └────────────┘  └────────────┘  └────────────┘  └────────────┘         │
│  ┌────────────┐  ┌────────────┐                                          │
│  │DeviceMemory│  │  GpuTimer  │                                          │
│  └────────────┘  └────────────┘                                          │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                            CUDA Runtime                                  │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐                         │
│  │   cudart   │  │   cuBLAS   │  │   Streams  │                         │
│  └────────────┘  └────────────┘  └────────────┘                         │
└─────────────────────────────────────────────────────────────────────────┘

Layer Responsibilities

Layer Core Responsibilities Key Components
Application User interaction, performance testing, demos Benchmark, MNIST Demo, Tests
Engine Inference execution, tensor management, optimization InferenceEngine, Tensor, AutoTuner
Kernel Core computation, GEMM optimization Various GEMM Kernels
Infrastructure Resource management, utilities MemoryPool, Logger, Config
CUDA Runtime GPU low-level interface CUDA Driver, cuBLAS

Core Components

1. InferenceEngine

The core class of the inference engine, responsible for managing neural network forward propagation.

Class Design

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
class InferenceEngine {
public:
    // Lifecycle management
    void init(int device_id = 0);
    void cleanup();
    
    // Weight management
    bool load_weights(const std::string& path);
    bool save_weights(const std::string& path) const;
    void add_layer(int in_features, int out_features, bool has_bias,
                   const float* weights_data, const float* bias_data = nullptr);
    
    // Inference execution
    void forward(const float* input, float* output, int batch_size);
    void forward_with_timing(const float* input, float* output, int batch_size,
                             std::vector<float>& layer_times_ms);
    
    // Query interfaces
    size_t num_layers() const { return layers_.size(); }
    int input_dim() const;
    int output_dim() const;
    bool is_initialized() const { return initialized_; }

private:
    std::vector<LayerWeights> layers_;
    cublasHandle_t cublas_handle_;
    cudaStream_t stream_;
    DeviceMemory temp_buffers_[2];  // Double buffer design
    bool initialized_ = false;
};

Design Decisions

Decision Rationale
RAII resource management Prevent resource leaks, ensure exception safety
Disable copy constructor GPU resources should not be shared
Support move semantics Allow ownership transfer
Double buffer design Avoid read/write conflicts, support alternating computation
Fused kernel Reduce memory bandwidth consumption

Weight File Format

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Weight File Layout:
═════════════════════════════════════════════════════════════
Offset  Size        Description
═════════════════════════════════════════════════════════════
0       32 bytes    Header
  ├─ 0   4 bytes    Magic number (0x4D494E49 = "MINI")
  ├─ 4   4 bytes    Version (1)
  ├─ 8   4 bytes    Number of layers
  └─ 12  20 bytes   Reserved
═════════════════════════════════════════════════════════════
32      Variable    Layer Data (repeated for each layer)
  ├─ 0   4 bytes    Layer type (0 = Linear)
  ├─ 4   4 bytes    Input dimension (in_features)
  ├─ 8   4 bytes    Output dimension (out_features)
  ├─ 12  4 bytes    Has bias flag (0 or 1)
  ├─ 16  in×out×4   Weight data (row-major)
  └─ ... out×4      Bias data (if has_bias)
═════════════════════════════════════════════════════════════

2. Tensor

N-dimensional tensor class providing GPU storage and basic operations.

Class Design

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
class Tensor {
public:
    // Construction and lifecycle
    Tensor() = default;
    explicit Tensor(const std::vector<int>& shape);
    Tensor(const std::vector<int>& shape, const float* data);
    
    // Disable copy, enable move
    Tensor(const Tensor&) = delete;
    Tensor& operator=(const Tensor&) = delete;
    Tensor(Tensor&&) noexcept = default;
    Tensor& operator=(Tensor&&) noexcept = default;
    
    // Shape operations
    void reshape(const std::vector<int>& new_shape);
    Tensor clone() const;
    
    // Data access
    float* data();
    const float* data() const;
    std::vector<float> to_host() const;
    void copy_from_host(const float* host_data);
    void copy_to_host(float* host_data) const;
    
    // Fill operations
    void fill(float value);
    void zero();
    
    // Shape queries
    const std::vector<int>& shape() const { return shape_; }
    const std::vector<int>& strides() const { return strides_; }
    size_t size() const { return size_; }
    int ndim() const { return static_cast<int>(shape_.size()); }
    int dim(int i) const { return shape_[i]; }
    
    // View conversion
    MatrixDesc as_matrix() const;

private:
    std::vector<int> shape_;
    std::vector<int> strides_;
    size_t size_ = 0;
    PooledMemory data_;  // Use memory pool
};

Memory Layout

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Row-Major Storage:
═════════════════════════════════════════════════════════════
Shape: [2, 3, 4]
Strides: [12, 4, 1]

Index calculation:
element(i, j, k) = i * 12 + j * 4 + k * 1

Memory layout:
┌──────────────────────────────────────────────────────┐
│ [0,0,0] [0,0,1] [0,0,2] [0,0,3]                    │  Block 0,0
│ [0,1,0] [0,1,1] [0,1,2] [0,1,3]                    │  Block 0,1
│ [0,2,0] [0,2,1] [0,2,2] [0,2,3]                    │  Block 0,2
│ [1,0,0] [1,0,1] [1,0,2] [1,0,3]                    │  Block 1,0
│ [1,1,0] [1,1,1] [1,1,2] [1,1,3]                    │  Block 1,1
│ [1,2,0] [1,2,1] [1,2,2] [1,2,3]                    │  Block 1,2
└──────────────────────────────────────────────────────┘
═════════════════════════════════════════════════════════════

Design Decisions

Decision Rationale
Use memory pool (PooledMemory) Reduce cudaMalloc overhead
Disable copy constructor Avoid accidental large memory copies
Row-major storage Compatible with C/C++ and CUDA
Lazy stride calculation Compute only when needed

3. MemoryPool

GPU memory pool that reduces cudaMalloc calls through caching.

Class Design

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
class MemoryPool {
public:
    static MemoryPool& instance();  // Singleton pattern
    
    // Allocation and deallocation
    void* allocate(size_t size);
    void deallocate(void* ptr);
    
    // Cache management
    void clear_cache();    // Release cache, keep active allocations
    void clear_all();      // Clear cache and reset statistics
    void release_all();    // Release all memory (only for shutdown)
    
    // Status queries
    bool owns(void* ptr) const;
    bool is_cached(void* ptr) const;
    size_t live_allocated_bytes() const;
    size_t cached_block_count() const;
    size_t active_block_count() const;
    
    // Statistics
    struct Stats {
        size_t total_allocated;   // Total allocated bytes
        size_t cached_size;       // Cache size
        size_t cache_hits;        // Cache hit count
        size_t cache_misses;      // Cache miss count
    };
    Stats get_stats() const;
    void print_stats() const;

private:
    MemoryPool() = default;
    mutable std::mutex mutex_;
    std::map<void*, size_t> allocated_blocks_;
    std::multimap<size_t, void*> free_blocks_;
    static constexpr size_t ALIGNMENT = 256;
};

Allocation Strategy

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Memory Allocation Flow:
═════════════════════════════════════════════════════════════
                    ┌─────────────────┐
                    │ allocate(size)  │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │ Align to 256B   │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
              ┌────▶│ Check free_blocks│
              │     └────────┬────────┘
              │              │
              │     ┌────────▼────────┐
              │     │ Found suitable? │
              │     └────────┬────────┘
              │              │
         ┌────┴────┐    ┌────▼────┐
         │   Yes   │    │   No    │
         └────┬────┘    └────┬────┘
              │              │
    ┌─────────▼────┐  ┌──────▼──────┐
    │ Return cached│  │ cudaMalloc  │
    │ cache_hits++ │  │ cache_misses++│
    └──────────────┘  └─────────────┘
═════════════════════════════════════════════════════════════

Performance Optimizations

Optimization Effect
256-byte alignment Enable vectorized loads
Best-fit strategy Reduce memory fragmentation
Multi-map storage O(log n) lookup
Thread-safety (mutex) Multi-thread safe

4. StreamManager

CUDA stream manager supporting concurrent execution.

Class Design

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
class StreamManager {
public:
    static StreamManager& instance();
    
    // Initialization
    void init(int num_streams = 4);
    
    // Get stream
    cudaStream_t get_stream();           // Round-robin
    cudaStream_t get_stream(int index);  // By index
    
    // Synchronization
    void sync_all();
    void sync(int index);
    
    // Cleanup
    void cleanup();
    
    int num_streams() const { return static_cast<int>(streams_.size()); }

private:
    std::vector<cudaStream_t> streams_;
    int current_stream_ = 0;
    mutable std::mutex mutex_;
};

Concurrent Execution Patterns

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Single Stream (Serial):
═════════════════════════════════════════════════════════════
Stream 0: ├─GEMM 0─┼─GEMM 1─┼─GEMM 2─┼─GEMM 3─┤
Time:     0        T        2T       3T       4T

Multi-Stream (Parallel):
═════════════════════════════════════════════════════════════
Stream 0: ├─GEMM 0─┼─────────┼─GEMM 2─┼─────────┤
Stream 1: ├─────────┼─GEMM 1─┼─────────┼─GEMM 3─┤
Time:     0        T        2T       3T       4T

Overlapped Execution:
═════════════════════════════════════════════════════════════
Stream 0: ├─GEMM 0─┼─GEMM 2─┤
Stream 1: ├─GEMM 1─┼─GEMM 3─┤
Time:     0        T        2T

Speedup: 2x
═════════════════════════════════════════════════════════════

GEMM Kernel Architecture

Optimization Hierarchy

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
                    ┌─────────────────┐
                    │   Vectorized    │ Level 7
                    │   (float4)      │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │     Fused       │ Level 6
                    │ (+ Bias + ReLU) │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │    Register     │ Level 5
                    │    Blocking     │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │ Double Buffer   │ Level 4
                    │   (2x shared)   │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │   Coalesced     │ Level 3
                    │ Memory Access   │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │     Tiled       │ Level 2
                    │ (Shared Memory) │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │     Naive       │ Level 1
                    │  (Global Mem)   │
                    └─────────────────┘

Template Parameter Design

1
2
3
4
5
6
7
8
9
template<
    int BM,    // Block tile M dimension
    int BN,    // Block tile N dimension
    int BK,    // Block tile K dimension (per iteration)
    int TM,    // Thread tile M dimension
    int TN     // Thread tile N dimension
>
__global__ void optimized_gemm(const float* A, const float* B, float* C,
                                int M, int N, int K);

Parameter Constraints

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Constraint Validation:
═════════════════════════════════════════════════════════════
1. Thread count constraint:
   threads = (BM / TM) × (BN / TN) ≤ 1024
   
2. Shared memory constraint:
   shared = (BM × BK + BK × BN) × sizeof(float) ≤ 48KB
   
3. Register constraint:
   regs ≈ TM × TN + TM + TN + overhead ≤ 255
═════════════════════════════════════════════════════════════

Recommended Configurations:
═════════════════════════════════════════════════════════════
Config       BM   BN   BK   TM   TN   Threads   Shared
─────────────────────────────────────────────────────────
Small         64   64    8    4    4     256     4KB
Medium       128  128    8    8    8     256     8KB
Large        128  256   16    8    8     512    24KB
═════════════════════════════════════════════════════════════

Data Flow Design

Inference Data Flow

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
Inference Execution Flow:
═════════════════════════════════════════════════════════════

Host                    Device
─────                   ──────
   │
   │  cudaMemcpyH2D
   ├──────────────────────▶  Input Buffer
   │                              │
   │                              ▼
   │                         ┌─────────────┐
   │                         │  Layer 0    │
   │                         │ GEMM+ReLU   │
   │                         └──────┬──────┘
   │                                │
   │                                ▼
   │                         ┌─────────────┐
   │                         │  Layer 1    │
   │                         │ GEMM+ReLU   │
   │                         └──────┬──────┘
   │                                │
   │                                ▼
   │                         ┌─────────────┐
   │                         │  Layer N    │
   │                         │ GEMM        │
   │                         └──────┬──────┘
   │                                │
   │  cudaMemcpyD2H                 ▼
   │◀──────────────────────  Output Buffer
   │
   ▼
Result
═════════════════════════════════════════════════════════════

Weight Loading Flow

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Weight File Loading:
═════════════════════════════════════════════════════════════
File                    Host                    Device
─────                   ─────                   ──────
   │
   │  fread
   ├──────────────────▶  Header
   │
   │                     │ Validate magic & version
   │                     ▼
   │  fread
   ├──────────────────▶  Layer Meta
   │                     │ Validate dimensions
   │                     ▼
   │  fread              │
   ├──────────────────▶  Weight Data
   │                     │ cudaMemcpyH2D
   │                     ├──────────────────────▶  d_weights
   │                     │
   │  fread (if bias)    │
   ├──────────────────▶  Bias Data
                         │ cudaMemcpyH2D
                         ├──────────────────────▶  d_bias
                         │
                         ▼
                   Ready for Inference
═════════════════════════════════════════════════════════════

Memory Management

Memory Strategy Comparison

Strategy Implementation Use Case Performance
Direct allocation cudaMalloc/cudaFree Simple programs Low
RAII wrapper DeviceMemory General applications Medium
Memory pool MemoryPool High-frequency allocation High
Pre-allocation Pre-allocate large blocks Fixed workloads Highest
1
2
3
4
5
6
7
8
9
10
11
12
13
14
// Temporary memory: Use DeviceMemory RAII
{
    DeviceMemory temp(M * N * sizeof(float));
    // Use temp...
}  // Auto-release

// Frequent allocation: Use MemoryPool
auto& pool = MemoryPool::instance();
void* ptr = pool.allocate(size);
// Use ptr...
pool.deallocate(ptr);  // Return to cache

// Tensor data: Use PooledMemory
Tensor t({batch, channels, height, width});  // Internal PooledMemory

Error Handling

Exception Hierarchy

1
2
3
4
5
6
7
8
std::exception
    │
    ├── CudaException          // CUDA runtime errors
    │   └── cudaError_t error_
    │
    ├── std::runtime_error     // cuBLAS errors, etc.
    │
    └── std::invalid_argument  // Parameter validation errors

CUDA Error Checking Macros

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#define CUDA_CHECK(call)                                      \
    do {                                                      \
        cudaError_t err = call;                               \
        if (err != cudaSuccess) {                             \
            throw CudaException(err, __FILE__, __LINE__);     \
        }                                                     \
    } while (0)

#define CUBLAS_CHECK(call)                                    \
    do {                                                      \
        cublasStatus_t status = call;                         \
        if (status != CUBLAS_STATUS_SUCCESS) {                \
            throw std::runtime_error("cuBLAS error");         \
        }                                                     \
    } while (0)

Usage Example

1
2
3
4
5
6
7
8
9
10
try {
    CUDA_CHECK(cudaMalloc(&ptr, size));
    launch_kernel<<<grid, block>>>(...);
    CUDA_CHECK(cudaGetLastError());
} catch (const CudaException& e) {
    std::cerr << "CUDA Error: " << e.what() << std::endl;
    std::cerr << "Error Code: " << cudaGetErrorName(e.error()) << std::endl;
} catch (const std::exception& e) {
    std::cerr << "Error: " << e.what() << std::endl;
}

Extension Guide

Adding a New GEMM Kernel

  1. Declare kernel (include/kernels.cuh):
1
2
void launch_my_gemm(const float* A, const float* B, float* C,
                    int M, int N, int K, cudaStream_t stream = 0);
  1. Implement kernel (src/my_gemm.cu):
1
2
3
4
5
6
7
__global__ void my_gemm_kernel(...) { ... }

void launch_my_gemm(...) {
    // Configure grid/block
    // Launch kernel
    // Error checking
}
  1. Add enum (include/common.h):
1
2
3
4
enum class GemmKernelType {
    // ... existing types ...
    MY_KERNEL
};
  1. Register in AutoTuner (include/autotuner.h):
1
2
3
case GemmKernelType::MY_KERNEL:
    launch_my_gemm(A, B, C, M, N, K, stream);
    break;

Adding a New Activation Function

  1. Extend FusionConfig (include/common.h):
1
2
3
4
5
6
struct FusionConfig {
    bool add_bias = false;
    bool apply_relu = false;
    bool apply_gelu = false;       // New
    bool apply_sigmoid = false;    // New
};
  1. Extend fused_gemm.cu:
1
2
3
4
5
6
7
8
9
10
11
template<..., bool APPLY_GELU, bool APPLY_SIGMOID>
__global__ void fused_gemm_bias_activation(...) {
    // ...
    if constexpr (APPLY_GELU) {
        val = gelu(val);
    }
    if constexpr (APPLY_SIGMOID) {
        val = sigmoid(val);
    }
    // ...
}


*Last Updated: 2025-04-16 Document Version: v1.1.0*

Back to top

MIT License | A learning project for the CUDA community