AGENTS.md - Mini-ImagePipe AI 代理指南

本文件为 AI 编码助手（Claude、Copilot、Codex 等）提供项目深度上下文和业务逻辑指导。

项目概述

Mini-ImagePipe 是一个基于 DAG 任务图架构的 GPU 加速图像处理管道框架。专为高吞吐量视频流处理场景设计，支持全 GPU 管道执行，适用于自动驾驶感知、医学图像处理、嵌入式 AI 等工业应用。

核心架构

┌─────────────────────────────────────────────────────────────┐
│                      Pipeline (端到端管道)                    │
├─────────────────────────────────────────────────────────────┤
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐  │
│  │  Operator 1  │───▶│  Operator 2  │───▶│  Operator N  │  │
│  │  (Resize)    │    │  (Blur)      │    │  (Sobel)     │  │
│  └──────────────┘    └──────────────┘    └──────────────┘  │
├─────────────────────────────────────────────────────────────┤
│                    DAGScheduler (多流调度)                    │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐          │
│  │Stream 0 │ │Stream 1 │ │Stream 2 │ │Stream 3 │          │
│  └─────────┘ └─────────┘ └─────────┘ └─────────┘          │
├─────────────────────────────────────────────────────────────┤
│                   TaskGraph (DAG 拓扑管理)                    │
│  - 拓扑排序    - 环检测    - 依赖追踪                         │
├─────────────────────────────────────────────────────────────┤
│                  MemoryManager (内存池管理)                   │
│  - Pinned Memory Pool    - Device Memory Pool               │
│  - Best-fit Allocation   - Block Reuse                      │
└─────────────────────────────────────────────────────────────┘

核心组件详解

1. MemoryManager (内存管理器)

职责: 管理页面锁定内存池和设备内存池

关键特性:

使用 cudaHostAlloc 分配 Pinned Memory
异步内存拷贝与 CUDA 流配合
内存池复用减少分配开销
分配失败时回退到 Pageable Memory

源文件: src/memory_manager.cu, include/memory_manager.h

使用模式:

MemoryManager mgr;
mgr.initialize(4);  // 4 个流的缓冲区
void* pinned = mgr.allocatePinned(size);
mgr.freePinned(pinned, size);

2. TaskGraph (任务图)

职责: 管理 DAG 拓扑结构

关键特性:

拓扑排序确定执行顺序
环检测防止死锁
依赖关系追踪
并行任务识别

源文件: src/task_graph.cpp, include/task_graph.h

使用模式:

TaskGraph graph;
int n1 = graph.addNode(op1);
int n2 = graph.addNode(op2);
graph.addEdge(n1, n2);  // n1 -> n2 依赖
auto order = graph.topologicalSort();

3. DAGScheduler (调度器)

职责: 多流调度与执行

关键特性:

多 CUDA 流并发执行
事件同步跨流依赖
流分配策略
错误传播机制

源文件: src/scheduler.cu, include/scheduler.h

使用模式:

DAGScheduler scheduler;
scheduler.initialize(4);  // 4 个 CUDA 流
scheduler.setTaskGraph(&graph);
scheduler.execute();

4. Pipeline (管道)

职责: 端到端管道管理

关键特性:

算子连接管理
中间缓冲区自动分配
批处理支持

已知限制:

多依赖节点限制: 钻石拓扑结构中，具有多个依赖的节点只会使用第一个依赖的输出。例如在 A→B, A→C, B→D, C→D 拓扑中，D 只会收到 B 的输出，C 的输出会被忽略。建议使用线性或树形拓扑避免此问题。
运行时参数配置

源文件: src/pipeline.cpp, include/pipeline.h

使用模式:

PipelineConfig config;
config.numStreams = 4;
Pipeline pipeline(config);

int n1 = pipeline.addOperator("Resize", resizeOp);
int n2 = pipeline.addOperator("Blur", blurOp);
pipeline.connect(n1, n2);
pipeline.setInput(n1, d_input, w, h, c);
pipeline.execute();
void* output = pipeline.getOutput(n2);

算子实现模式

通用算子接口

class Operator {
public:
    virtual ~Operator() = default;
    virtual bool execute(cudaStream_t stream) = 0;
    virtual void setInput(void* data, int w, int h, int ch) = 0;
    virtual void* getOutput() = 0;
    virtual size_t getOutputSize() const = 0;
};

算子实现清单

算子	文件	关键技术
GaussianBlur	`src/operators/gaussian_blur.cu`	可分离滤波、共享内存、Halo 区域
Sobel	`src/operators/sobel.cu`	梯度计算、共享内存优化
Resize	`src/operators/resize.cu`	双线性/最近邻插值
ColorConvert	`src/operators/color_convert.cu`	RGB↔Gray、BGR↔RGB

CUDA 优化模式

共享内存 + Halo 区域 (GaussianBlur):

__shared__ float tile[BLOCK_H + 2*HALO][BLOCK_W + 2*HALO];
// 加载包含 halo 的数据块
// 在共享内存中进行卷积

可分离滤波 (GaussianBlur):

// 两遍处理：先水平、后垂直
// 复杂度从 O(n²) 降到 O(2n)
horizontalPass<<<...>>>(src, temp, kernel);
verticalPass<<<...>>>(temp, dst, kernel);

常量内存 (Sobel):

__constant__ float sobelKernel[9];
// Sobel 核小且固定，适合常量内存

测试策略

属性测试 (Property-based Testing)

每个测试使用 100 次随机迭代验证属性：

TEST(GaussianBlurTest, MultiChannelSupport) {
    for (int iter = 0; iter < 100; ++iter) {
        int width = rand() % 1920 + 1;
        int height = rand() % 1080 + 1;
        int channels = rand() % 4 + 1;
        // Feature: mini-image-pipe, Property 1
        // 验证：输出尺寸 = 输入尺寸
    }
}

测试覆盖矩阵

测试文件	覆盖组件	属性编号
`test_memory_manager.cpp`	MemoryManager	P16-P18
`test_task_graph.cpp`	TaskGraph	P11-P13
`test_scheduler.cpp`	DAGScheduler	P14-P15
`test_pipeline.cpp`	Pipeline	P19-P22
`test_gaussian_blur.cpp`	GaussianBlur	P1-P3
`test_sobel.cpp`	Sobel	P4-P5
`test_resize.cpp`	Resize	P6-P7
`test_color_convert.cpp`	ColorConvert	P8-P10

OpenSpec 工作流

本项目使用 OpenSpec 进行规范驱动开发：

┌─────────────────────────────────────────────────┐
│  /opsx:propose "<idea>"                         │
│  创建变更提案 → proposal.md + design.md + tasks.md│
└─────────────────────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────┐
│  /opsx:apply [name]                             │
│  实施变更任务 → 按任务顺序执行 → 更新勾选状态      │
└─────────────────────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────┐
│  /opsx:archive <name>                           │
│  归档变更 → 合并 delta spec 到主规范              │
└─────────────────────────────────────────────────┘

规范文件位置

主规范: openspec/specs/image-pipeline/spec.md
变更提案: openspec/changes/<name>/
归档变更: openspec/changes/archive/<name>/

构建与调试

构建命令

cmake --preset default      # Debug 构建
cmake --preset release      # Release 构建
cmake --preset minimal      # 仅本地 GPU 架构（编译更快）
cmake --build --preset default
ctest --preset default

常见问题排查

问题	原因	解决方案
nvcc not found	CUDA 未在 PATH	`export PATH=/usr/local/cuda/bin:$PATH`
CMake 版本过低	< 3.18	升级 CMake
编译 OOM	多架构编译	使用 minimal preset
测试 SIGSEGV	GPU 内存不足	减小测试图像尺寸

代码修改指南

添加新算子

在 include/operators/ 创建头文件
在 src/operators/ 创建实现文件
在 CMakeLists.txt 的 LIB_SOURCES 添加源文件
在 tests/ 创建测试文件
在 openspec/specs/ 更新规范

修改调度器

理解 TaskGraph → DAGScheduler → Pipeline 的依赖关系
修改 scheduler.cu 和 scheduler.h
更新 test_scheduler.cpp
验证多流并发正确性

性能优化

使用 Nsight Systems 分析瓶颈
使用 Nsight Compute 分析内核性能
优先优化内存访问模式
验证优化后测试仍然通过

版本与发布

当前版本: 1.0.0
版本文件: VERSION
发布流程: 创建 tag → GitHub Release

此文件由 AI 代理指南生成器自动创建，后续维护请保持同步。