Architecture

Language: English 简体中文

System Architecture Overview
Core Components
GEMM Kernel Architecture
Data Flow Design
Memory Management
Error Handling
Extension Guide

System Architecture Overview

Layered Architecture

Mini-Inference Engine adopts a clear layered architecture design:

┌─────────────────────────────────────────────────────────────────────────┐
│                           Application Layer                              │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐  ┌────────────┐         │
│  │ Benchmark  │  │ MNIST Demo │  │   Tests    │  │  User App  │         │
│  └────────────┘  └────────────┘  └────────────┘  └────────────┘         │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                           Engine Layer                                   │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │                      InferenceEngine                               │  │
│  │  - Weight Loading    - Forward Pass    - Layer Management         │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐  ┌────────────┐         │
│  │   Tensor   │  │ AutoTuner  │  │  Profiler  │  │   Config   │         │
│  └────────────┘  └────────────┘  └────────────┘  └────────────┘         │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                           Kernel Layer                                   │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐       │
│  │   Naive  │ │   Tiled  │ │Coalesced │ │  Double  │ │Optimized │       │
│  │ MatMul   │ │   GEMM   │ │   GEMM   │ │  Buffer  │ │   GEMM   │       │
│  └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘       │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐                     │
│  │   Fused  │ │Vectorized│ │   Half   │ │ Batched  │                     │
│  │   GEMM   │ │   GEMM   │ │   GEMM   │ │   GEMM   │                     │
│  └──────────┘ └──────────┘ └──────────┘ └──────────┘                     │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                         Infrastructure Layer                             │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐  ┌────────────┐         │
│  │ MemoryPool │  │StreamMgr   │  │   Logger   │  │Quantization│         │
│  └────────────┘  └────────────┘  └────────────┘  └────────────┘         │
│  ┌────────────┐  ┌────────────┐                                          │
│  │DeviceMemory│  │  GpuTimer  │                                          │
│  └────────────┘  └────────────┘                                          │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                            CUDA Runtime                                  │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐                         │
│  │   cudart   │  │   cuBLAS   │  │   Streams  │                         │
│  └────────────┘  └────────────┘  └────────────┘                         │
└─────────────────────────────────────────────────────────────────────────┘

Layer Responsibilities

Layer	Core Responsibilities	Key Components
Application	User interaction, performance testing, demos	Benchmark, MNIST Demo, Tests
Engine	Inference execution, tensor management, optimization	InferenceEngine, Tensor, AutoTuner
Kernel	Core computation, GEMM optimization	Various GEMM Kernels
Infrastructure	Resource management, utilities	MemoryPool, Logger, Config
CUDA Runtime	GPU low-level interface	CUDA Driver, cuBLAS

Core Components

1. InferenceEngine

The core class of the inference engine, responsible for managing neural network forward propagation.

Class Design

class InferenceEngine {
public:
    // Lifecycle management
    void init(int device_id = 0);
    void cleanup();
    
    // Weight management
    bool load_weights(const std::string& path);
    bool save_weights(const std::string& path) const;
    void add_layer(int in_features, int out_features, bool has_bias,
                   const float* weights_data, const float* bias_data = nullptr);
    
    // Inference execution
    void forward(const float* input, float* output, int batch_size);
    void forward_with_timing(const float* input, float* output, int batch_size,
                             std::vector<float>& layer_times_ms);
    
    // Query interfaces
    size_t num_layers() const { return layers_.size(); }
    int input_dim() const;
    int output_dim() const;
    bool is_initialized() const { return initialized_; }

private:
    std::vector<LayerWeights> layers_;
    cublasHandle_t cublas_handle_;
    cudaStream_t stream_;
    DeviceMemory temp_buffers_[2];  // Double buffer design
    bool initialized_ = false;
};

Design Decisions

Decision	Rationale
RAII resource management	Prevent resource leaks, ensure exception safety
Disable copy constructor	GPU resources should not be shared
Support move semantics	Allow ownership transfer
Double buffer design	Avoid read/write conflicts, support alternating computation
Fused kernel	Reduce memory bandwidth consumption

Weight File Format

Weight File Layout:
═════════════════════════════════════════════════════════════
Offset  Size        Description
═════════════════════════════════════════════════════════════
0       32 bytes    Header
  ├─ 0   4 bytes    Magic number (0x4D494E49 = "MINI")
  ├─ 4   4 bytes    Version (1)
  ├─ 8   4 bytes    Number of layers
  └─ 12  20 bytes   Reserved
═════════════════════════════════════════════════════════════
32      Variable    Layer Data (repeated for each layer)
  ├─ 0   4 bytes    Layer type (0 = Linear)
  ├─ 4   4 bytes    Input dimension (in_features)
  ├─ 8   4 bytes    Output dimension (out_features)
  ├─ 12  4 bytes    Has bias flag (0 or 1)
  ├─ 16  in×out×4   Weight data (row-major)
  └─ ... out×4      Bias data (if has_bias)
═════════════════════════════════════════════════════════════

2. Tensor

N-dimensional tensor class providing GPU storage and basic operations.

Class Design

class Tensor {
public:
    // Construction and lifecycle
    Tensor() = default;
    explicit Tensor(const std::vector<int>& shape);
    Tensor(const std::vector<int>& shape, const float* data);
    
    // Disable copy, enable move
    Tensor(const Tensor&) = delete;
    Tensor& operator=(const Tensor&) = delete;
    Tensor(Tensor&&) noexcept = default;
    Tensor& operator=(Tensor&&) noexcept = default;
    
    // Shape operations
    void reshape(const std::vector<int>& new_shape);
    Tensor clone() const;
    
    // Data access
    float* data();
    const float* data() const;
    std::vector<float> to_host() const;
    void copy_from_host(const float* host_data);
    void copy_to_host(float* host_data) const;
    
    // Fill operations
    void fill(float value);
    void zero();
    
    // Shape queries
    const std::vector<int>& shape() const { return shape_; }
    const std::vector<int>& strides() const { return strides_; }
    size_t size() const { return size_; }
    int ndim() const { return static_cast<int>(shape_.size()); }
    int dim(int i) const { return shape_[i]; }
    
    // View conversion
    MatrixDesc as_matrix() const;

private:
    std::vector<int> shape_;
    std::vector<int> strides_;
    size_t size_ = 0;
    PooledMemory data_;  // Use memory pool
};

Memory Layout

Row-Major Storage:
═════════════════════════════════════════════════════════════
Shape: [2, 3, 4]
Strides: [12, 4, 1]

Index calculation:
element(i, j, k) = i * 12 + j * 4 + k * 1

Memory layout:
┌──────────────────────────────────────────────────────┐
│ [0,0,0] [0,0,1] [0,0,2] [0,0,3]                    │  Block 0,0
│ [0,1,0] [0,1,1] [0,1,2] [0,1,3]                    │  Block 0,1
│ [0,2,0] [0,2,1] [0,2,2] [0,2,3]                    │  Block 0,2
│ [1,0,0] [1,0,1] [1,0,2] [1,0,3]                    │  Block 1,0
│ [1,1,0] [1,1,1] [1,1,2] [1,1,3]                    │  Block 1,1
│ [1,2,0] [1,2,1] [1,2,2] [1,2,3]                    │  Block 1,2
└──────────────────────────────────────────────────────┘
═════════════════════════════════════════════════════════════

Design Decisions

Decision	Rationale
Use memory pool (PooledMemory)	Reduce cudaMalloc overhead
Disable copy constructor	Avoid accidental large memory copies
Row-major storage	Compatible with C/C++ and CUDA
Lazy stride calculation	Compute only when needed

3. MemoryPool

GPU memory pool that reduces cudaMalloc calls through caching.

Class Design

class MemoryPool {
public:
    static MemoryPool& instance();  // Singleton pattern
    
    // Allocation and deallocation
    void* allocate(size_t size);
    void deallocate(void* ptr);
    
    // Cache management
    void clear_cache();    // Release cache, keep active allocations
    void clear_all();      // Clear cache and reset statistics
    void release_all();    // Release all memory (only for shutdown)
    
    // Status queries
    bool owns(void* ptr) const;
    bool is_cached(void* ptr) const;
    size_t live_allocated_bytes() const;
    size_t cached_block_count() const;
    size_t active_block_count() const;
    
    // Statistics
    struct Stats {
        size_t total_allocated;   // Total allocated bytes
        size_t cached_size;       // Cache size
        size_t cache_hits;        // Cache hit count
        size_t cache_misses;      // Cache miss count
    };
    Stats get_stats() const;
    void print_stats() const;

private:
    MemoryPool() = default;
    mutable std::mutex mutex_;
    std::map<void*, size_t> allocated_blocks_;
    std::multimap<size_t, void*> free_blocks_;
    static constexpr size_t ALIGNMENT = 256;
};

Allocation Strategy

Memory Allocation Flow:
═════════════════════════════════════════════════════════════
                    ┌─────────────────┐
                    │ allocate(size)  │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │ Align to 256B   │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
              ┌────▶│ Check free_blocks│
              │     └────────┬────────┘
              │              │
              │     ┌────────▼────────┐
              │     │ Found suitable? │
              │     └────────┬────────┘
              │              │
         ┌────┴────┐    ┌────▼────┐
         │   Yes   │    │   No    │
         └────┬────┘    └────┬────┘
              │              │
    ┌─────────▼────┐  ┌──────▼──────┐
    │ Return cached│  │ cudaMalloc  │
    │ cache_hits++ │  │ cache_misses++│
    └──────────────┘  └─────────────┘
═════════════════════════════════════════════════════════════

Performance Optimizations

Optimization	Effect
256-byte alignment	Enable vectorized loads
Best-fit strategy	Reduce memory fragmentation
Multi-map storage	O(log n) lookup
Thread-safety (mutex)	Multi-thread safe

4. StreamManager

CUDA stream manager supporting concurrent execution.

Class Design

class StreamManager {
public:
    static StreamManager& instance();
    
    // Initialization
    void init(int num_streams = 4);
    
    // Get stream
    cudaStream_t get_stream();           // Round-robin
    cudaStream_t get_stream(int index);  // By index
    
    // Synchronization
    void sync_all();
    void sync(int index);
    
    // Cleanup
    void cleanup();
    
    int num_streams() const { return static_cast<int>(streams_.size()); }

private:
    std::vector<cudaStream_t> streams_;
    int current_stream_ = 0;
    mutable std::mutex mutex_;
};

Concurrent Execution Patterns

Single Stream (Serial):
═════════════════════════════════════════════════════════════
Stream 0: ├─GEMM 0─┼─GEMM 1─┼─GEMM 2─┼─GEMM 3─┤
Time:     0        T        2T       3T       4T

Multi-Stream (Parallel):
═════════════════════════════════════════════════════════════
Stream 0: ├─GEMM 0─┼─────────┼─GEMM 2─┼─────────┤
Stream 1: ├─────────┼─GEMM 1─┼─────────┼─GEMM 3─┤
Time:     0        T        2T       3T       4T

Overlapped Execution:
═════════════════════════════════════════════════════════════
Stream 0: ├─GEMM 0─┼─GEMM 2─┤
Stream 1: ├─GEMM 1─┼─GEMM 3─┤
Time:     0        T        2T

Speedup: 2x
═════════════════════════════════════════════════════════════

GEMM Kernel Architecture

Optimization Hierarchy

                    ┌─────────────────┐
                    │   Vectorized    │ Level 7
                    │   (float4)      │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │     Fused       │ Level 6
                    │ (+ Bias + ReLU) │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │    Register     │ Level 5
                    │    Blocking     │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │ Double Buffer   │ Level 4
                    │   (2x shared)   │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │   Coalesced     │ Level 3
                    │ Memory Access   │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │     Tiled       │ Level 2
                    │ (Shared Memory) │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │     Naive       │ Level 1
                    │  (Global Mem)   │
                    └─────────────────┘

Template Parameter Design

template<
    int BM,    // Block tile M dimension
    int BN,    // Block tile N dimension
    int BK,    // Block tile K dimension (per iteration)
    int TM,    // Thread tile M dimension
    int TN     // Thread tile N dimension
>
__global__ void optimized_gemm(const float* A, const float* B, float* C,
                                int M, int N, int K);

Parameter Constraints

Constraint Validation:
═════════════════════════════════════════════════════════════
1. Thread count constraint:
   threads = (BM / TM) × (BN / TN) ≤ 1024
   
2. Shared memory constraint:
   shared = (BM × BK + BK × BN) × sizeof(float) ≤ 48KB
   
3. Register constraint:
   regs ≈ TM × TN + TM + TN + overhead ≤ 255
═════════════════════════════════════════════════════════════

Recommended Configurations:
═════════════════════════════════════════════════════════════
Config       BM   BN   BK   TM   TN   Threads   Shared
─────────────────────────────────────────────────────────
Small         64   64    8    4    4     256     4KB
Medium       128  128    8    8    8     256     8KB
Large        128  256   16    8    8     512    24KB
═════════════════════════════════════════════════════════════

Data Flow Design

Inference Data Flow

Inference Execution Flow:
═════════════════════════════════════════════════════════════

Host                    Device
─────                   ──────
   │
   │  cudaMemcpyH2D
   ├──────────────────────▶  Input Buffer
   │                              │
   │                              ▼
   │                         ┌─────────────┐
   │                         │  Layer 0    │
   │                         │ GEMM+ReLU   │
   │                         └──────┬──────┘
   │                                │
   │                                ▼
   │                         ┌─────────────┐
   │                         │  Layer 1    │
   │                         │ GEMM+ReLU   │
   │                         └──────┬──────┘
   │                                │
   │                                ▼
   │                         ┌─────────────┐
   │                         │  Layer N    │
   │                         │ GEMM        │
   │                         └──────┬──────┘
   │                                │
   │  cudaMemcpyD2H                 ▼
   │◀──────────────────────  Output Buffer
   │
   ▼
Result
═════════════════════════════════════════════════════════════

Weight Loading Flow

Weight File Loading:
═════════════════════════════════════════════════════════════
File                    Host                    Device
─────                   ─────                   ──────
   │
   │  fread
   ├──────────────────▶  Header
   │
   │                     │ Validate magic & version
   │                     ▼
   │  fread
   ├──────────────────▶  Layer Meta
   │                     │ Validate dimensions
   │                     ▼
   │  fread              │
   ├──────────────────▶  Weight Data
   │                     │ cudaMemcpyH2D
   │                     ├──────────────────────▶  d_weights
   │                     │
   │  fread (if bias)    │
   ├──────────────────▶  Bias Data
                         │ cudaMemcpyH2D
                         ├──────────────────────▶  d_bias
                         │
                         ▼
                   Ready for Inference
═════════════════════════════════════════════════════════════

Memory Management

Memory Strategy Comparison

Strategy	Implementation	Use Case	Performance
Direct allocation	`cudaMalloc`/`cudaFree`	Simple programs	Low
RAII wrapper	`DeviceMemory`	General applications	Medium
Memory pool	`MemoryPool`	High-frequency allocation	High
Pre-allocation	Pre-allocate large blocks	Fixed workloads	Highest

Recommended Usage

// Temporary memory: Use DeviceMemory RAII
{
    DeviceMemory temp(M * N * sizeof(float));
    // Use temp...
}  // Auto-release

// Frequent allocation: Use MemoryPool
auto& pool = MemoryPool::instance();
void* ptr = pool.allocate(size);
// Use ptr...
pool.deallocate(ptr);  // Return to cache

// Tensor data: Use PooledMemory
Tensor t({batch, channels, height, width});  // Internal PooledMemory

Error Handling

Exception Hierarchy

std::exception
    │
    ├── CudaException          // CUDA runtime errors
    │   └── cudaError_t error_
    │
    ├── std::runtime_error     // cuBLAS errors, etc.
    │
    └── std::invalid_argument  // Parameter validation errors

CUDA Error Checking Macros

#define CUDA_CHECK(call)                                      \
    do {                                                      \
        cudaError_t err = call;                               \
        if (err != cudaSuccess) {                             \
            throw CudaException(err, __FILE__, __LINE__);     \
        }                                                     \
    } while (0)

#define CUBLAS_CHECK(call)                                    \
    do {                                                      \
        cublasStatus_t status = call;                         \
        if (status != CUBLAS_STATUS_SUCCESS) {                \
            throw std::runtime_error("cuBLAS error");         \
        }                                                     \
    } while (0)

Usage Example

try {
    CUDA_CHECK(cudaMalloc(&ptr, size));
    launch_kernel<<<grid, block>>>(...);
    CUDA_CHECK(cudaGetLastError());
} catch (const CudaException& e) {
    std::cerr << "CUDA Error: " << e.what() << std::endl;
    std::cerr << "Error Code: " << cudaGetErrorName(e.error()) << std::endl;
} catch (const std::exception& e) {
    std::cerr << "Error: " << e.what() << std::endl;
}

Extension Guide

Adding a New GEMM Kernel

Declare kernel (include/kernels.cuh):

void launch_my_gemm(const float* A, const float* B, float* C,
                    int M, int N, int K, cudaStream_t stream = 0);

Implement kernel (src/my_gemm.cu):

__global__ void my_gemm_kernel(...) { ... }

void launch_my_gemm(...) {
    // Configure grid/block
    // Launch kernel
    // Error checking
}

Add enum (include/common.h):

enum class GemmKernelType {
    // ... existing types ...
    MY_KERNEL
};

Register in AutoTuner (include/autotuner.h):

case GemmKernelType::MY_KERNEL:
    launch_my_gemm(A, B, C, M, N, K, stream);
    break;

Adding a New Activation Function

Extend FusionConfig (include/common.h):

struct FusionConfig {
    bool add_bias = false;
    bool apply_relu = false;
    bool apply_gelu = false;       // New
    bool apply_sigmoid = false;    // New
};

Extend fused_gemm.cu:

template<..., bool APPLY_GELU, bool APPLY_SIGMOID>
__global__ void fused_gemm_bias_activation(...) {
    // ...
    if constexpr (APPLY_GELU) {
        val = gelu(val);
    }
    if constexpr (APPLY_SIGMOID) {
        val = sigmoid(val);
    }
    // ...
}

*Last Updated: 2025-04-16

Document Version: v1.1.0*

Table of Contents

System Architecture Overview

Layered Architecture

Layer Responsibilities

Core Components

1. InferenceEngine

Class Design

Design Decisions

Weight File Format

2. Tensor

Class Design

Memory Layout

Design Decisions

3. MemoryPool

Class Design

Allocation Strategy

Performance Optimizations

4. StreamManager

Class Design

Concurrent Execution Patterns

GEMM Kernel Architecture

Optimization Hierarchy

Template Parameter Design

Parameter Constraints

Data Flow Design

Inference Data Flow

Weight Loading Flow

Memory Management

Memory Strategy Comparison

Recommended Usage

Error Handling

Exception Hierarchy

CUDA Error Checking Macros

Usage Example

Extension Guide

Adding a New GEMM Kernel

Adding a New Activation Function

Related Links