架构设计 | CUDA GEMM Tutorial

Language: 简体中文 English

目录 (Table of Contents)

系统架构概览
核心组件
GEMM Kernel 架构
数据流设计
内存管理
错误处理机制
扩展指南

系统架构概览

分层架构

Mini-Inference Engine 采用清晰的分层架构设计：

┌─────────────────────────────────────────────────────────────────────────┐
│                           Application Layer                              │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐  ┌────────────┐         │
│  │ Benchmark  │  │ MNIST Demo │  │   Tests    │  │  User App  │         │
│  └────────────┘  └────────────┘  └────────────┘  └────────────┘         │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                           Engine Layer                                   │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │                      InferenceEngine                               │  │
│  │  - 权重加载    - 前向传播    - 层管理                              │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐  ┌────────────┐         │
│  │   Tensor   │  │ AutoTuner  │  │  Profiler  │  │   Config   │         │
│  └────────────┘  └────────────┘  └────────────┘  └────────────┘         │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                           Kernel Layer                                   │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐       │
│  │   Naive  │ │   Tiled  │ │Coalesced │ │  Double  │ │Optimized │       │
│  │ MatMul   │ │   GEMM   │ │   GEMM   │ │  Buffer  │ │   GEMM   │       │
│  └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘       │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐                     │
│  │   Fused  │ │Vectorized│ │   Half   │ │ Batched  │                     │
│  │   GEMM   │ │   GEMM   │ │   GEMM   │ │   GEMM   │                     │
│  └──────────┘ └──────────┘ └──────────┘ └──────────┘                     │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                         Infrastructure Layer                             │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐  ┌────────────┐         │
│  │ MemoryPool │  │StreamMgr   │  │   Logger   │  │Quantization│         │
│  └────────────┘  └────────────┘  └────────────┘  └────────────┘         │
│  ┌────────────┐  ┌────────────┐                                          │
│  │DeviceMemory│  │  GpuTimer  │                                          │
│  └────────────┘  └────────────┘                                          │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                            CUDA Runtime                                  │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐                         │
│  │   cudart   │  │   cuBLAS   │  │   Streams  │                         │
│  └────────────┘  └────────────┘  └────────────┘                         │
└─────────────────────────────────────────────────────────────────────────┘

各层职责

层级	核心职责	关键组件
Application	用户交互、性能测试、示例程序	Benchmark, MNIST Demo, Tests
Engine	推理执行、张量管理、性能优化	InferenceEngine, Tensor, AutoTuner
Kernel	核心计算、GEMM 优化实现	各类 GEMM Kernel
Infrastructure	资源管理、工具支持	MemoryPool, Logger, Config
CUDA Runtime	GPU 底层接口	CUDA Driver, cuBLAS

核心组件

1. InferenceEngine（推理引擎）

推理引擎是系统的核心类，负责管理神经网络的前向传播。

类设计

class InferenceEngine {
public:
    // 生命周期管理
    void init(int device_id = 0);
    void cleanup();
    
    // 权重管理
    bool load_weights(const std::string& path);
    bool save_weights(const std::string& path) const;
    void add_layer(int in_features, int out_features, bool has_bias,
                   const float* weights_data, const float* bias_data = nullptr);
    
    // 推理执行
    void forward(const float* input, float* output, int batch_size);
    void forward_with_timing(const float* input, float* output, int batch_size,
                             std::vector<float>& layer_times_ms);
    
    // 查询接口
    size_t num_layers() const { return layers_.size(); }
    int input_dim() const;
    int output_dim() const;
    bool is_initialized() const { return initialized_; }

private:
    std::vector<LayerWeights> layers_;
    cublasHandle_t cublas_handle_;
    cudaStream_t stream_;
    DeviceMemory temp_buffers_[2];  // 双缓冲设计
    bool initialized_ = false;
};

设计决策

决策	理由
RAII 资源管理	避免资源泄漏，确保异常安全
禁用拷贝构造	GPU 资源不应共享
支持移动语义	允许所有权转移
双缓冲设计	避免读写冲突，支持交替计算
融合 kernel	减少内存带宽消耗

权重文件格式

权重文件布局 (Weight File Layout):
═════════════════════════════════════════════════════════════
Offset  Size        Description
═════════════════════════════════════════════════════════════
0       32 bytes    文件头 (Header)
  ├─ 0   4 bytes    魔数 (0x4D494E49 = "MINI")
  ├─ 4   4 bytes    版本号 (1)
  ├─ 8   4 bytes    层数
  └─ 12  20 bytes   保留区域
═════════════════════════════════════════════════════════════
32      Variable    层数据 (Layer Data，重复层数次)
  ├─ 0   4 bytes    层类型 (0 = Linear)
  ├─ 4   4 bytes    输入维度 (in_features)
  ├─ 8   4 bytes    输出维度 (out_features)
  ├─ 12  4 bytes    是否有偏置 (0 或 1)
  ├─ 16  in×out×4   权重数据 (行优先存储)
  └─ ... out×4      偏置数据 (如果有)
═════════════════════════════════════════════════════════════

2. Tensor（张量）

N 维张量类，提供 GPU 存储和基本操作。

类设计

class Tensor {
public:
    // 构造与生命周期
    Tensor() = default;
    explicit Tensor(const std::vector<int>& shape);
    Tensor(const std::vector<int>& shape, const float* data);
    
    // 禁用拷贝，允许移动
    Tensor(const Tensor&) = delete;
    Tensor& operator=(const Tensor&) = delete;
    Tensor(Tensor&&) noexcept = default;
    Tensor& operator=(Tensor&&) noexcept = default;
    
    // 形状操作
    void reshape(const std::vector<int>& new_shape);
    Tensor clone() const;
    
    // 数据访问
    float* data();
    const float* data() const;
    std::vector<float> to_host() const;
    void copy_from_host(const float* host_data);
    void copy_to_host(float* host_data) const;
    
    // 填充操作
    void fill(float value);
    void zero();
    
    // 形状查询
    const std::vector<int>& shape() const { return shape_; }
    const std::vector<int>& strides() const { return strides_; }
    size_t size() const { return size_; }
    int ndim() const { return static_cast<int>(shape_.size()); }
    int dim(int i) const { return shape_[i]; }
    
    // 视图转换
    MatrixDesc as_matrix() const;

private:
    std::vector<int> shape_;
    std::vector<int> strides_;
    size_t size_ = 0;
    PooledMemory data_;  // 使用内存池
};

内存布局

行优先存储 (Row-Major):
═════════════════════════════════════════════════════════════
Shape: [2, 3, 4]
Strides: [12, 4, 1]

索引计算:
element(i, j, k) = i * 12 + j * 4 + k * 1

内存布局:
┌──────────────────────────────────────────────────────┐
│ [0,0,0] [0,0,1] [0,0,2] [0,0,3]                    │  Block 0,0
│ [0,1,0] [0,1,1] [0,1,2] [0,1,3]                    │  Block 0,1
│ [0,2,0] [0,2,1] [0,2,2] [0,2,3]                    │  Block 0,2
│ [1,0,0] [1,0,1] [1,0,2] [1,0,3]                    │  Block 1,0
│ [1,1,0] [1,1,1] [1,1,2] [1,1,3]                    │  Block 1,1
│ [1,2,0] [1,2,1] [1,2,2] [1,2,3]                    │  Block 1,2
└──────────────────────────────────────────────────────┘
═════════════════════════════════════════════════════════════

设计决策

决策	理由
使用内存池 (PooledMemory)	减少 cudaMalloc 开销
禁用拷贝构造	避免意外的大内存拷贝
行优先存储	与 C/C++ 和 CUDA 兼容
延迟计算 strides	仅在需要时计算

3. MemoryPool（内存池）

GPU 内存池，通过缓存减少 cudaMalloc 调用。

类设计

class MemoryPool {
public:
    static MemoryPool& instance();  // 单例模式
    
    // 分配与释放
    void* allocate(size_t size);
    void deallocate(void* ptr);
    
    // 缓存管理
    void clear_cache();    // 释放缓存，保留活跃分配
    void clear_all();      // 清理缓存并重置统计
    void release_all();    // 释放所有内存（仅用于关闭时）
    
    // 状态查询
    bool owns(void* ptr) const;
    bool is_cached(void* ptr) const;
    size_t live_allocated_bytes() const;
    size_t cached_block_count() const;
    size_t active_block_count() const;
    
    // 统计
    struct Stats {
        size_t total_allocated;   // 累计分配字节数
        size_t cached_size;       // 缓存大小
        size_t cache_hits;        // 缓存命中次数
        size_t cache_misses;      // 缓存未命中次数
    };
    Stats get_stats() const;
    void print_stats() const;

private:
    MemoryPool() = default;
    mutable std::mutex mutex_;
    std::map<void*, size_t> allocated_blocks_;
    std::multimap<size_t, void*> free_blocks_;
    static constexpr size_t ALIGNMENT = 256;
};

分配策略

内存分配流程:
═════════════════════════════════════════════════════════════
                    ┌─────────────────┐
                    │ allocate(size)  │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │ 对齐到 256 字节  │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
              ┌────▶│ 检查 free_blocks │
              │     └────────┬────────┘
              │              │
              │     ┌────────▼────────┐
              │     │ 找到合适块？     │
              │     └────────┬────────┘
              │              │
         ┌────┴────┐    ┌────▼────┐
         │   是    │    │   否    │
         └────┬────┘    └────┬────┘
              │              │
    ┌─────────▼────┐  ┌──────▼──────┐
    │ 返回缓存块    │  │ cudaMalloc  │
    │ cache_hits++ │  │ cache_misses++│
    └──────────────┘  └─────────────┘
═════════════════════════════════════════════════════════════

性能优化

优化技术	效果
256 字节对齐	支持向量化加载
Best-fit 策略	减少内存碎片
多 map 存储空闲块	O(log n) 查找
线程安全 (mutex)	多线程安全

4. StreamManager（流管理器）

CUDA 流管理器，支持并发执行。

类设计

class StreamManager {
public:
    static StreamManager& instance();
    
    // 初始化
    void init(int num_streams = 4);
    
    // 获取流
    cudaStream_t get_stream();           // 轮询分配
    cudaStream_t get_stream(int index);  // 指定索引
    
    // 同步
    void sync_all();
    void sync(int index);
    
    // 清理
    void cleanup();
    
    int num_streams() const { return static_cast<int>(streams_.size()); }

private:
    std::vector<cudaStream_t> streams_;
    int current_stream_ = 0;
    mutable std::mutex mutex_;
};

并发执行模式

单流执行（串行）:
═════════════════════════════════════════════════════════════
Stream 0: ├─GEMM 0─┼─GEMM 1─┼─GEMM 2─┼─GEMM 3─┤
时间:     0        T        2T       3T       4T

多流执行（并行）:
═════════════════════════════════════════════════════════════
Stream 0: ├─GEMM 0─┼─────────┼─GEMM 2─┼─────────┤
Stream 1: ├─────────┼─GEMM 1─┼─────────┼─GEMM 3─┤
时间:     0        T        2T       3T       4T

重叠执行:
═════════════════════════════════════════════════════════════
Stream 0: ├─GEMM 0─┼─GEMM 2─┤
Stream 1: ├─GEMM 1─┼─GEMM 3─┤
时间:     0        T        2T

加速比: 2x
═════════════════════════════════════════════════════════════

GEMM Kernel 架构

优化层次架构

                    ┌─────────────────┐
                    │   Vectorized    │ Level 7
                    │   (float4)      │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │     Fused       │ Level 6
                    │ (+ Bias + ReLU) │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │    Register     │ Level 5
                    │    Blocking     │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │ Double Buffer   │ Level 4
                    │   (2x shared)   │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │   Coalesced     │ Level 3
                    │ Memory Access   │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │     Tiled       │ Level 2
                    │ (Shared Memory) │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │     Naive       │ Level 1
                    │  (Global Mem)   │
                    └─────────────────┘

模板参数设计

template<
    int BM,    // Block tile M 维度
    int BN,    // Block tile N 维度
    int BK,    // Block tile K 维度（每次迭代）
    int TM,    // Thread tile M 维度
    int TN     // Thread tile N 维度
>
__global__ void optimized_gemm(const float* A, const float* B, float* C,
                                int M, int N, int K);

参数约束

约束验证:
═════════════════════════════════════════════════════════════
1. 线程数约束:
   threads = (BM / TM) × (BN / TN) ≤ 1024
   
2. 共享内存约束:
   shared = (BM × BK + BK × BN) × sizeof(float) ≤ 48KB
   
3. 寄存器约束:
   regs ≈ TM × TN + TM + TN + overhead ≤ 255
═════════════════════════════════════════════════════════════

推荐配置:
═════════════════════════════════════════════════════════════
配置          BM   BN   BK   TM   TN   Threads   Shared
─────────────────────────────────────────────────────────
Small         64   64    8    4    4     256     4KB
Medium       128  128    8    8    8     256     8KB
Large        128  256   16    8    8     512    24KB
═════════════════════════════════════════════════════════════

数据流设计

推理数据流

推理执行流程:
═════════════════════════════════════════════════════════════

Host                    Device
─────                   ──────
   │
   │  cudaMemcpyH2D
   ├──────────────────────▶  Input Buffer
   │                              │
   │                              ▼
   │                         ┌─────────────┐
   │                         │  Layer 0    │
   │                         │ GEMM+ReLU   │
   │                         └──────┬──────┘
   │                                │
   │                                ▼
   │                         ┌─────────────┐
   │                         │  Layer 1    │
   │                         │ GEMM+ReLU   │
   │                         └──────┬──────┘
   │                                │
   │                                ▼
   │                         ┌─────────────┐
   │                         │  Layer N    │
   │                         │ GEMM        │
   │                         └──────┬──────┘
   │                                │
   │  cudaMemcpyD2H                 ▼
   │◀──────────────────────  Output Buffer
   │
   ▼
Result
═════════════════════════════════════════════════════════════

权重加载流程

权重文件加载:
═════════════════════════════════════════════════════════════
File                    Host                    Device
─────                   ─────                   ──────
   │
   │  fread
   ├──────────────────▶  Header
   │
   │                     │ 验证魔数和版本
   │                     ▼
   │  fread
   ├──────────────────▶  Layer Meta
   │                     │ 验证维度
   │                     ▼
   │  fread              │
   ├──────────────────▶  Weight Data
   │                     │ cudaMemcpyH2D
   │                     ├──────────────────────▶  d_weights
   │                     │
   │  fread (if bias)    │
   ├──────────────────▶  Bias Data
                         │ cudaMemcpyH2D
                         ├──────────────────────▶  d_bias
                         │
                         ▼
                   Ready for Inference
═════════════════════════════════════════════════════════════

内存管理

内存策略对比

策略	实现	适用场景	性能
直接分配	`cudaMalloc`/`cudaFree`	简单程序	低
RAII 包装	`DeviceMemory`	一般应用	中
内存池	`MemoryPool`	高频分配	高
预分配	预先分配大块内存	固定负载	最高

错误处理机制

异常层次

std::exception
    │
    ├── CudaException          // CUDA 运行时错误
    │   └── cudaError_t error_
    │
    ├── std::runtime_error     // cuBLAS 错误等
    │
    └── std::invalid_argument  // 参数验证错误

CUDA 错误检查宏

#define CUDA_CHECK(call)                                      \
    do {                                                      \
        cudaError_t err = call;                               \
        if (err != cudaSuccess) {                             \
            throw CudaException(err, __FILE__, __LINE__);     \
        }                                                     \
    } while (0)

#define CUBLAS_CHECK(call)                                    \
    do {                                                      \
        cublasStatus_t status = call;                         \
        if (status != CUBLAS_STATUS_SUCCESS) {                \
            throw std::runtime_error("cuBLAS error");         \
        }                                                     \
    } while (0)

使用示例

try {
    CUDA_CHECK(cudaMalloc(&ptr, size));
    launch_kernel<<<grid, block>>>(...);
    CUDA_CHECK(cudaGetLastError());
} catch (const CudaException& e) {
    std::cerr << "CUDA 错误: " << e.what() << std::endl;
    std::cerr << "错误码: " << cudaGetErrorName(e.error()) << std::endl;
} catch (const std::exception& e) {
    std::cerr << "错误: " << e.what() << std::endl;
}

扩展指南

添加新的 GEMM Kernel

声明 kernel（include/kernels.cuh）：

void launch_my_gemm(const float* A, const float* B, float* C,
                    int M, int N, int K, cudaStream_t stream = 0);

实现 kernel（src/my_gemm.cu）：

__global__ void my_gemm_kernel(...) { ... }

void launch_my_gemm(...) {
    // 配置 grid/block
    // 启动 kernel
    // 错误检查
}

添加枚举（include/common.h）：

enum class GemmKernelType {
    // ... existing types ...
    MY_KERNEL
};

注册到 AutoTuner（include/autotuner.h）：

case GemmKernelType::MY_KERNEL:
    launch_my_gemm(A, B, C, M, N, K, stream);
    break;

添加新的激活函数

扩展 FusionConfig（include/common.h）：

struct FusionConfig {
    bool add_bias = false;
    bool apply_relu = false;
    bool apply_gelu = false;       // 新增
    bool apply_sigmoid = false;    // 新增
};

扩展 fused_gemm.cu：

template<..., bool APPLY_GELU, bool APPLY_SIGMOID>
__global__ void fused_gemm_bias_activation(...) {
    // ...
    if constexpr (APPLY_GELU) {
        val = gelu(val);
    }
    if constexpr (APPLY_SIGMOID) {
        val = sigmoid(val);
    }
    // ...
}

目录 (Table of Contents)

系统架构概览

分层架构

各层职责

核心组件

1. InferenceEngine（推理引擎）

类设计

设计决策

权重文件格式

2. Tensor（张量）

类设计

内存布局

设计决策

3. MemoryPool（内存池）

类设计

分配策略

性能优化

4. StreamManager（流管理器）

类设计

并发执行模式

GEMM Kernel 架构

优化层次架构

模板参数设计

参数约束

数据流设计

推理数据流

权重加载流程

内存管理

内存策略对比

推荐用法

错误处理机制

异常层次

CUDA 错误检查宏

使用示例

扩展指南

添加新的 GEMM Kernel

添加新的激活函数

相关链接