Language: English 简体中文
Table of Contents
- System Architecture Overview
- Core Components
- GEMM Kernel Architecture
- Data Flow Design
- Memory Management
- Error Handling
- Extension Guide
System Architecture Overview
Layered Architecture
Mini-Inference Engine adopts a clear layered architecture design:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
┌─────────────────────────────────────────────────────────────────────────┐
│ Application Layer │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Benchmark │ │ MNIST Demo │ │ Tests │ │ User App │ │
│ └────────────┘ └────────────┘ └────────────┘ └────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ Engine Layer │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ InferenceEngine │ │
│ │ - Weight Loading - Forward Pass - Layer Management │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Tensor │ │ AutoTuner │ │ Profiler │ │ Config │ │
│ └────────────┘ └────────────┘ └────────────┘ └────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ Kernel Layer │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Naive │ │ Tiled │ │Coalesced │ │ Double │ │Optimized │ │
│ │ MatMul │ │ GEMM │ │ GEMM │ │ Buffer │ │ GEMM │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Fused │ │Vectorized│ │ Half │ │ Batched │ │
│ │ GEMM │ │ GEMM │ │ GEMM │ │ GEMM │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ Infrastructure Layer │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ MemoryPool │ │StreamMgr │ │ Logger │ │Quantization│ │
│ └────────────┘ └────────────┘ └────────────┘ └────────────┘ │
│ ┌────────────┐ ┌────────────┐ │
│ │DeviceMemory│ │ GpuTimer │ │
│ └────────────┘ └────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ CUDA Runtime │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ cudart │ │ cuBLAS │ │ Streams │ │
│ └────────────┘ └────────────┘ └────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
Layer Responsibilities
| Layer | Core Responsibilities | Key Components |
|---|---|---|
| Application | User interaction, performance testing, demos | Benchmark, MNIST Demo, Tests |
| Engine | Inference execution, tensor management, optimization | InferenceEngine, Tensor, AutoTuner |
| Kernel | Core computation, GEMM optimization | Various GEMM Kernels |
| Infrastructure | Resource management, utilities | MemoryPool, Logger, Config |
| CUDA Runtime | GPU low-level interface | CUDA Driver, cuBLAS |
Core Components
1. InferenceEngine
The core class of the inference engine, responsible for managing neural network forward propagation.
Class Design
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
class InferenceEngine {
public:
// Lifecycle management
void init(int device_id = 0);
void cleanup();
// Weight management
bool load_weights(const std::string& path);
bool save_weights(const std::string& path) const;
void add_layer(int in_features, int out_features, bool has_bias,
const float* weights_data, const float* bias_data = nullptr);
// Inference execution
void forward(const float* input, float* output, int batch_size);
void forward_with_timing(const float* input, float* output, int batch_size,
std::vector<float>& layer_times_ms);
// Query interfaces
size_t num_layers() const { return layers_.size(); }
int input_dim() const;
int output_dim() const;
bool is_initialized() const { return initialized_; }
private:
std::vector<LayerWeights> layers_;
cublasHandle_t cublas_handle_;
cudaStream_t stream_;
DeviceMemory temp_buffers_[2]; // Double buffer design
bool initialized_ = false;
};
Design Decisions
| Decision | Rationale |
|---|---|
| RAII resource management | Prevent resource leaks, ensure exception safety |
| Disable copy constructor | GPU resources should not be shared |
| Support move semantics | Allow ownership transfer |
| Double buffer design | Avoid read/write conflicts, support alternating computation |
| Fused kernel | Reduce memory bandwidth consumption |
Weight File Format
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Weight File Layout:
═════════════════════════════════════════════════════════════
Offset Size Description
═════════════════════════════════════════════════════════════
0 32 bytes Header
├─ 0 4 bytes Magic number (0x4D494E49 = "MINI")
├─ 4 4 bytes Version (1)
├─ 8 4 bytes Number of layers
└─ 12 20 bytes Reserved
═════════════════════════════════════════════════════════════
32 Variable Layer Data (repeated for each layer)
├─ 0 4 bytes Layer type (0 = Linear)
├─ 4 4 bytes Input dimension (in_features)
├─ 8 4 bytes Output dimension (out_features)
├─ 12 4 bytes Has bias flag (0 or 1)
├─ 16 in×out×4 Weight data (row-major)
└─ ... out×4 Bias data (if has_bias)
═════════════════════════════════════════════════════════════
2. Tensor
N-dimensional tensor class providing GPU storage and basic operations.
Class Design
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
class Tensor {
public:
// Construction and lifecycle
Tensor() = default;
explicit Tensor(const std::vector<int>& shape);
Tensor(const std::vector<int>& shape, const float* data);
// Disable copy, enable move
Tensor(const Tensor&) = delete;
Tensor& operator=(const Tensor&) = delete;
Tensor(Tensor&&) noexcept = default;
Tensor& operator=(Tensor&&) noexcept = default;
// Shape operations
void reshape(const std::vector<int>& new_shape);
Tensor clone() const;
// Data access
float* data();
const float* data() const;
std::vector<float> to_host() const;
void copy_from_host(const float* host_data);
void copy_to_host(float* host_data) const;
// Fill operations
void fill(float value);
void zero();
// Shape queries
const std::vector<int>& shape() const { return shape_; }
const std::vector<int>& strides() const { return strides_; }
size_t size() const { return size_; }
int ndim() const { return static_cast<int>(shape_.size()); }
int dim(int i) const { return shape_[i]; }
// View conversion
MatrixDesc as_matrix() const;
private:
std::vector<int> shape_;
std::vector<int> strides_;
size_t size_ = 0;
PooledMemory data_; // Use memory pool
};
Memory Layout
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Row-Major Storage:
═════════════════════════════════════════════════════════════
Shape: [2, 3, 4]
Strides: [12, 4, 1]
Index calculation:
element(i, j, k) = i * 12 + j * 4 + k * 1
Memory layout:
┌──────────────────────────────────────────────────────┐
│ [0,0,0] [0,0,1] [0,0,2] [0,0,3] │ Block 0,0
│ [0,1,0] [0,1,1] [0,1,2] [0,1,3] │ Block 0,1
│ [0,2,0] [0,2,1] [0,2,2] [0,2,3] │ Block 0,2
│ [1,0,0] [1,0,1] [1,0,2] [1,0,3] │ Block 1,0
│ [1,1,0] [1,1,1] [1,1,2] [1,1,3] │ Block 1,1
│ [1,2,0] [1,2,1] [1,2,2] [1,2,3] │ Block 1,2
└──────────────────────────────────────────────────────┘
═════════════════════════════════════════════════════════════
Design Decisions
| Decision | Rationale |
|---|---|
| Use memory pool (PooledMemory) | Reduce cudaMalloc overhead |
| Disable copy constructor | Avoid accidental large memory copies |
| Row-major storage | Compatible with C/C++ and CUDA |
| Lazy stride calculation | Compute only when needed |
3. MemoryPool
GPU memory pool that reduces cudaMalloc calls through caching.
Class Design
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
class MemoryPool {
public:
static MemoryPool& instance(); // Singleton pattern
// Allocation and deallocation
void* allocate(size_t size);
void deallocate(void* ptr);
// Cache management
void clear_cache(); // Release cache, keep active allocations
void clear_all(); // Clear cache and reset statistics
void release_all(); // Release all memory (only for shutdown)
// Status queries
bool owns(void* ptr) const;
bool is_cached(void* ptr) const;
size_t live_allocated_bytes() const;
size_t cached_block_count() const;
size_t active_block_count() const;
// Statistics
struct Stats {
size_t total_allocated; // Total allocated bytes
size_t cached_size; // Cache size
size_t cache_hits; // Cache hit count
size_t cache_misses; // Cache miss count
};
Stats get_stats() const;
void print_stats() const;
private:
MemoryPool() = default;
mutable std::mutex mutex_;
std::map<void*, size_t> allocated_blocks_;
std::multimap<size_t, void*> free_blocks_;
static constexpr size_t ALIGNMENT = 256;
};
Allocation Strategy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Memory Allocation Flow:
═════════════════════════════════════════════════════════════
┌─────────────────┐
│ allocate(size) │
└────────┬────────┘
│
┌────────▼────────┐
│ Align to 256B │
└────────┬────────┘
│
┌────────▼────────┐
┌────▶│ Check free_blocks│
│ └────────┬────────┘
│ │
│ ┌────────▼────────┐
│ │ Found suitable? │
│ └────────┬────────┘
│ │
┌────┴────┐ ┌────▼────┐
│ Yes │ │ No │
└────┬────┘ └────┬────┘
│ │
┌─────────▼────┐ ┌──────▼──────┐
│ Return cached│ │ cudaMalloc │
│ cache_hits++ │ │ cache_misses++│
└──────────────┘ └─────────────┘
═════════════════════════════════════════════════════════════
Performance Optimizations
| Optimization | Effect |
|---|---|
| 256-byte alignment | Enable vectorized loads |
| Best-fit strategy | Reduce memory fragmentation |
| Multi-map storage | O(log n) lookup |
| Thread-safety (mutex) | Multi-thread safe |
4. StreamManager
CUDA stream manager supporting concurrent execution.
Class Design
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
class StreamManager {
public:
static StreamManager& instance();
// Initialization
void init(int num_streams = 4);
// Get stream
cudaStream_t get_stream(); // Round-robin
cudaStream_t get_stream(int index); // By index
// Synchronization
void sync_all();
void sync(int index);
// Cleanup
void cleanup();
int num_streams() const { return static_cast<int>(streams_.size()); }
private:
std::vector<cudaStream_t> streams_;
int current_stream_ = 0;
mutable std::mutex mutex_;
};
Concurrent Execution Patterns
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Single Stream (Serial):
═════════════════════════════════════════════════════════════
Stream 0: ├─GEMM 0─┼─GEMM 1─┼─GEMM 2─┼─GEMM 3─┤
Time: 0 T 2T 3T 4T
Multi-Stream (Parallel):
═════════════════════════════════════════════════════════════
Stream 0: ├─GEMM 0─┼─────────┼─GEMM 2─┼─────────┤
Stream 1: ├─────────┼─GEMM 1─┼─────────┼─GEMM 3─┤
Time: 0 T 2T 3T 4T
Overlapped Execution:
═════════════════════════════════════════════════════════════
Stream 0: ├─GEMM 0─┼─GEMM 2─┤
Stream 1: ├─GEMM 1─┼─GEMM 3─┤
Time: 0 T 2T
Speedup: 2x
═════════════════════════════════════════════════════════════
GEMM Kernel Architecture
Optimization Hierarchy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
┌─────────────────┐
│ Vectorized │ Level 7
│ (float4) │
└────────┬────────┘
│
┌────────▼────────┐
│ Fused │ Level 6
│ (+ Bias + ReLU) │
└────────┬────────┘
│
┌────────▼────────┐
│ Register │ Level 5
│ Blocking │
└────────┬────────┘
│
┌────────▼────────┐
│ Double Buffer │ Level 4
│ (2x shared) │
└────────┬────────┘
│
┌────────▼────────┐
│ Coalesced │ Level 3
│ Memory Access │
└────────┬────────┘
│
┌────────▼────────┐
│ Tiled │ Level 2
│ (Shared Memory) │
└────────┬────────┘
│
┌────────▼────────┐
│ Naive │ Level 1
│ (Global Mem) │
└─────────────────┘
Template Parameter Design
1
2
3
4
5
6
7
8
9
template<
int BM, // Block tile M dimension
int BN, // Block tile N dimension
int BK, // Block tile K dimension (per iteration)
int TM, // Thread tile M dimension
int TN // Thread tile N dimension
>
__global__ void optimized_gemm(const float* A, const float* B, float* C,
int M, int N, int K);
Parameter Constraints
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Constraint Validation:
═════════════════════════════════════════════════════════════
1. Thread count constraint:
threads = (BM / TM) × (BN / TN) ≤ 1024
2. Shared memory constraint:
shared = (BM × BK + BK × BN) × sizeof(float) ≤ 48KB
3. Register constraint:
regs ≈ TM × TN + TM + TN + overhead ≤ 255
═════════════════════════════════════════════════════════════
Recommended Configurations:
═════════════════════════════════════════════════════════════
Config BM BN BK TM TN Threads Shared
─────────────────────────────────────────────────────────
Small 64 64 8 4 4 256 4KB
Medium 128 128 8 8 8 256 8KB
Large 128 256 16 8 8 512 24KB
═════════════════════════════════════════════════════════════
Data Flow Design
Inference Data Flow
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
Inference Execution Flow:
═════════════════════════════════════════════════════════════
Host Device
───── ──────
│
│ cudaMemcpyH2D
├──────────────────────▶ Input Buffer
│ │
│ ▼
│ ┌─────────────┐
│ │ Layer 0 │
│ │ GEMM+ReLU │
│ └──────┬──────┘
│ │
│ ▼
│ ┌─────────────┐
│ │ Layer 1 │
│ │ GEMM+ReLU │
│ └──────┬──────┘
│ │
│ ▼
│ ┌─────────────┐
│ │ Layer N │
│ │ GEMM │
│ └──────┬──────┘
│ │
│ cudaMemcpyD2H ▼
│◀────────────────────── Output Buffer
│
▼
Result
═════════════════════════════════════════════════════════════
Weight Loading Flow
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Weight File Loading:
═════════════════════════════════════════════════════════════
File Host Device
───── ───── ──────
│
│ fread
├──────────────────▶ Header
│
│ │ Validate magic & version
│ ▼
│ fread
├──────────────────▶ Layer Meta
│ │ Validate dimensions
│ ▼
│ fread │
├──────────────────▶ Weight Data
│ │ cudaMemcpyH2D
│ ├──────────────────────▶ d_weights
│ │
│ fread (if bias) │
├──────────────────▶ Bias Data
│ cudaMemcpyH2D
├──────────────────────▶ d_bias
│
▼
Ready for Inference
═════════════════════════════════════════════════════════════
Memory Management
Memory Strategy Comparison
| Strategy | Implementation | Use Case | Performance |
|---|---|---|---|
| Direct allocation | cudaMalloc/cudaFree |
Simple programs | Low |
| RAII wrapper | DeviceMemory |
General applications | Medium |
| Memory pool | MemoryPool |
High-frequency allocation | High |
| Pre-allocation | Pre-allocate large blocks | Fixed workloads | Highest |
Recommended Usage
1
2
3
4
5
6
7
8
9
10
11
12
13
14
// Temporary memory: Use DeviceMemory RAII
{
DeviceMemory temp(M * N * sizeof(float));
// Use temp...
} // Auto-release
// Frequent allocation: Use MemoryPool
auto& pool = MemoryPool::instance();
void* ptr = pool.allocate(size);
// Use ptr...
pool.deallocate(ptr); // Return to cache
// Tensor data: Use PooledMemory
Tensor t({batch, channels, height, width}); // Internal PooledMemory
Error Handling
Exception Hierarchy
1
2
3
4
5
6
7
8
std::exception
│
├── CudaException // CUDA runtime errors
│ └── cudaError_t error_
│
├── std::runtime_error // cuBLAS errors, etc.
│
└── std::invalid_argument // Parameter validation errors
CUDA Error Checking Macros
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#define CUDA_CHECK(call) \
do { \
cudaError_t err = call; \
if (err != cudaSuccess) { \
throw CudaException(err, __FILE__, __LINE__); \
} \
} while (0)
#define CUBLAS_CHECK(call) \
do { \
cublasStatus_t status = call; \
if (status != CUBLAS_STATUS_SUCCESS) { \
throw std::runtime_error("cuBLAS error"); \
} \
} while (0)
Usage Example
1
2
3
4
5
6
7
8
9
10
try {
CUDA_CHECK(cudaMalloc(&ptr, size));
launch_kernel<<<grid, block>>>(...);
CUDA_CHECK(cudaGetLastError());
} catch (const CudaException& e) {
std::cerr << "CUDA Error: " << e.what() << std::endl;
std::cerr << "Error Code: " << cudaGetErrorName(e.error()) << std::endl;
} catch (const std::exception& e) {
std::cerr << "Error: " << e.what() << std::endl;
}
Extension Guide
Adding a New GEMM Kernel
- Declare kernel (
include/kernels.cuh):
1
2
void launch_my_gemm(const float* A, const float* B, float* C,
int M, int N, int K, cudaStream_t stream = 0);
- Implement kernel (
src/my_gemm.cu):
1
2
3
4
5
6
7
__global__ void my_gemm_kernel(...) { ... }
void launch_my_gemm(...) {
// Configure grid/block
// Launch kernel
// Error checking
}
- Add enum (
include/common.h):
1
2
3
4
enum class GemmKernelType {
// ... existing types ...
MY_KERNEL
};
- Register in AutoTuner (
include/autotuner.h):
1
2
3
case GemmKernelType::MY_KERNEL:
launch_my_gemm(A, B, C, M, N, K, stream);
break;
Adding a New Activation Function
- Extend FusionConfig (
include/common.h):
1
2
3
4
5
6
struct FusionConfig {
bool add_bias = false;
bool apply_relu = false;
bool apply_gelu = false; // New
bool apply_sigmoid = false; // New
};
- Extend fused_gemm.cu:
1
2
3
4
5
6
7
8
9
10
11
template<..., bool APPLY_GELU, bool APPLY_SIGMOID>
__global__ void fused_gemm_bias_activation(...) {
// ...
if constexpr (APPLY_GELU) {
val = gelu(val);
}
if constexpr (APPLY_SIGMOID) {
val = sigmoid(val);
}
// ...
}
Related Links
| *Last Updated: 2025-04-16 | Document Version: v1.1.0* |