Architecture Overview

Understanding Mini-ImagePipe’s design and components.

Table of contents

  1. System Architecture
    1. Component Diagram
  2. Core Components
    1. Pipeline
    2. TaskGraph
    3. DAGScheduler
    4. MemoryManager
  3. Execution Flow
  4. Memory Strategy
  5. Operator Interface
    1. Operator Implementations
      1. Gaussian Blur Operator
      2. Sobel Edge Detection Operator
      3. Resize Operator
      4. Color Conversion Operator
  6. Data Models
    1. Image Buffer
    2. Kernel Configuration
    3. Pipeline Configuration
  7. Thread Safety
  8. Testing Strategy
    1. Unit Tests
    2. Property-Based Tests
    3. Integration Tests
  9. Next Steps

System Architecture

┌───────────────────────────────────────────────────────┐
│                     Pipeline API                      │
├───────────────────────────────────────────────────────┤
│  TaskGraph   │  DAGScheduler   │  MemoryManager       │
├───────────────────────────────────────────────────────┤
│  Operators: Gaussian │ Sobel │ Resize │ ColorConvert  │
├───────────────────────────────────────────────────────┤
│  CUDA Streams   │   CUDA Events   │  Shared Memory   │
└───────────────────────────────────────────────────────┘

Component Diagram

graph TB
    subgraph "User API"
        Pipeline[Pipeline Builder]
    end
    
    subgraph "Core Framework"
        TaskGraph[Task Graph]
        Scheduler[DAG Scheduler]
        StreamMgr[CUDA Stream Manager]
        MemMgr[Memory Manager]
    end
    
    subgraph "Operators"
        Gaussian[Gaussian Blur]
        Sobel[Sobel Edge]
        Resize[Resize]
        ColorConv[Color Convert]
    end
    
    subgraph "GPU Resources"
        Streams[CUDA Streams]
        Events[CUDA Events]
        DevMem[Device Memory]
        SharedMem[Shared Memory]
    end
    
    Pipeline --> TaskGraph
    TaskGraph --> Scheduler
    Scheduler --> StreamMgr
    Scheduler --> MemMgr
    
    StreamMgr --> Streams
    StreamMgr --> Events
    MemMgr --> DevMem
    
    Scheduler --> Gaussian
    Scheduler --> Sobel
    Scheduler --> Resize
    Scheduler --> ColorConv
    
    Gaussian --> SharedMem
    Sobel --> SharedMem

Core Components

Pipeline

The main entry point for building and executing image processing workflows.

Responsibilities:

  • Manages operator registration and connection
  • Handles buffer allocation and lifecycle
  • Provides batch processing support
  • Coordinates execution across components
Pipeline pipeline(config);
int node = pipeline.addOperator("name", op);
pipeline.connect(from, to);
pipeline.execute();

TaskGraph

DAG-based task dependency management using Kahn’s algorithm for topological sorting.

Features:

  • Topological sorting: Determines execution order respecting dependencies
  • Cycle detection: Validates graph before execution using DFS
  • State tracking: Tasks track states (PENDING, RUNNING, COMPLETED, FAILED)
TaskGraph& graph = pipeline.getTaskGraph();
auto order = graph.topologicalSort();
bool hasCycle = graph.detectCycle();

DAGScheduler

CUDA multi-stream execution engine that maximizes GPU utilization.

How it works:

  1. Assigns tasks to available CUDA streams based on dependencies
  2. Uses CUDA events for synchronization between dependent tasks
  3. Independent tasks run concurrently on different streams
  4. Errors propagate downstream automatically
DAGScheduler& scheduler = pipeline.getScheduler();
scheduler.setNumStreams(4);
scheduler.setErrorCallback(callback);

MemoryManager

Pinned host and device memory pool management with best-fit allocation.

Design:

  • Best-fit strategy: Minimizes fragmentation by finding the smallest suitable block
  • Reuse across runs: Allocated buffers are retained for subsequent pipeline executions
  • Thread-safe: Safe for concurrent access from multiple threads
MemoryManager& mm = MemoryManager::getInstance();
void* ptr = mm.allocateDevice(size);
mm.freeDevice(ptr);

Execution Flow

  1. Build (addOperator, connect)
    • Register operators and define dependencies
    • Validate connections
  2. Validate (execute begins)
    • Check for cycles in the graph
    • Verify all operators have required inputs
  3. Allocate (internal)
    • Calculate buffer sizes based on topological order
    • Allocate from memory pools
  4. Execute (DAGScheduler::execute)
    • Launch kernels on assigned streams
    • Synchronize with CUDA events
    • Handle errors and propagate status
  5. Sync (cudaDeviceSynchronize)
    • Wait for all streams to complete
    • Results are ready in output buffers

Memory Strategy

Host Memory (Pinned)          Device Memory
┌─────────────────┐          ┌─────────────────┐
│  Pinned Pool    │  H2D/D2H │  Device Pool    │
│  (Best-fit)     │ ───────► │  (Best-fit)     │
│                 │          │                 │
│  Fast async     │          │  Reuse across   │
│  transfers      │          │  executions     │
└─────────────────┘          └─────────────────┘

Key benefits:

  • Pinned memory: Enables asynchronous host-to-device transfers
  • Pool reuse: Eliminates allocation overhead in processing loops
  • Best-fit: Reduces memory fragmentation over time

Operator Interface

All operators implement the IOperator interface:

class IOperator {
public:
    virtual ~IOperator() = default;
    
    // Execute the operator on GPU
    virtual cudaError_t execute(const void* input, void* output,
                                int width, int height, int channels,
                                cudaStream_t stream) = 0;
    
    // Get output dimensions
    virtual void getOutputSize(int inputWidth, int inputHeight,
                               int& outputWidth, int& outputHeight) = 0;
    
    // Get output channels
    virtual int getOutputChannels(int inputChannels) = 0;
};

Operator Implementations

Gaussian Blur Operator

class GaussianBlurOperator : public IOperator {
public:
    enum KernelSize { KERNEL_3x3 = 3, KERNEL_5x5 = 5, KERNEL_7x7 = 7 };
    
    GaussianBlurOperator(KernelSize size, float sigma = 0.0f);
    
    cudaError_t execute(
        const void* input, void* output,
        int width, int height, int channels,
        cudaStream_t stream
    ) override;
    
private:
    KernelSize kernelSize_;
    float sigma_;
    float* d_kernelH_;  // Horizontal 1D kernel on device
    float* d_kernelV_;  // Vertical 1D kernel on device
    void* d_intermediate_;  // Intermediate buffer for separable filter
};

Separable Filter Implementation:

  • Decompose 2D Gaussian kernel into two 1D kernels: G(x,y) = G(x) * G(y)
  • First pass: horizontal convolution with 1D kernel
  • Second pass: vertical convolution with 1D kernel
  • Complexity reduction: O(n²) → O(2n) per pixel

Shared Memory with Halo Regions:

+------------------+
|  Halo (top)      |
+--+------------+--+
|H |   Tile     | H|
|a |   Data     | a|
|l |            | l|
|o |            | o|
+--+------------+--+
|  Halo (bottom)   |
+------------------+

Sobel Edge Detection Operator

class SobelOperator : public IOperator {
public:
    SobelOperator();
    
    cudaError_t execute(
        const void* input, void* output,
        int width, int height, int channels,
        cudaStream_t stream
    ) override;
    
private:
    // Sobel kernels are constant, stored in constant memory
    // Gx = [-1 0 1; -2 0 2; -1 0 1]
    // Gy = [-1 -2 -1; 0 0 0; 1 2 1]
};

Resize Operator

class ResizeOperator : public IOperator {
public:
    enum InterpolationMode { NEAREST, BILINEAR };
    
    ResizeOperator(int targetWidth, int targetHeight, InterpolationMode mode);
    
    cudaError_t execute(
        const void* input, void* output,
        int width, int height, int channels,
        cudaStream_t stream
    ) override;
    
    void getOutputDimensions(
        int inputWidth, int inputHeight,
        int& outputWidth, int& outputHeight
    ) const override;
    
private:
    int targetWidth_;
    int targetHeight_;
    InterpolationMode mode_;
};

Coordinate Mapping:

src_x = dst_x * (src_width / dst_width)
src_y = dst_y * (src_height / dst_height)

Color Conversion Operator

class ColorConvertOperator : public IOperator {
public:
    enum ConversionType { RGB_TO_GRAY, BGR_TO_RGB, RGBA_TO_RGB };
    
    ColorConvertOperator(ConversionType type);
    
    cudaError_t execute(
        const void* input, void* output,
        int width, int height, int channels,
        cudaStream_t stream
    ) override;
    
private:
    ConversionType type_;
    // Luminance weights: Y = 0.299*R + 0.587*G + 0.114*B
    static constexpr float kLumR = 0.299f;
    static constexpr float kLumG = 0.587f;
    static constexpr float kLumB = 0.114f;
};

Data Models

Image Buffer

struct ImageBuffer {
    void* data;           // Pointer to pixel data
    int width;            // Image width in pixels
    int height;           // Image height in pixels
    int channels;         // Number of channels (1, 3, or 4)
    int stride;           // Row stride in bytes
    bool isDeviceMemory;  // True if data is on GPU
    bool isPinned;        // True if host memory is pinned
    
    size_t sizeInBytes() const {
        return static_cast<size_t>(stride) * height;
    }
};

Kernel Configuration

struct KernelConfig {
    dim3 blockSize;   // Thread block dimensions
    dim3 gridSize;    // Grid dimensions
    size_t sharedMem; // Shared memory size in bytes
    
    static KernelConfig forImage(int width, int height, int tileSize = 16) {
        KernelConfig cfg;
        cfg.blockSize = dim3(tileSize, tileSize);
        cfg.gridSize = dim3(
            (width + tileSize - 1) / tileSize,
            (height + tileSize - 1) / tileSize
        );
        return cfg;
    }
};

Pipeline Configuration

struct PipelineConfig {
    int numStreams = 4;           // Number of CUDA streams
    size_t pinnedPoolSize = 64 * 1024 * 1024;  // 64MB pinned memory pool
    bool enableProfiling = false; // Enable CUDA profiling
    int maxBatchSize = 8;         // Maximum frames in batch
};

Thread Safety

Component Thread Safety Notes
Pipeline Not thread-safe One pipeline per thread, or external synchronization
MemoryManager Thread-safe Uses mutex for pool operations
TaskGraph Not thread-safe Modify only during build phase
Operators Implementation-dependent Most are stateless and thread-safe

Testing Strategy

Unit Tests

Unit tests verify specific examples and edge cases:

  1. Operator Unit Tests
    • Test each operator with known input/output pairs
    • Test boundary conditions (1x1 image, maximum size image)
    • Test error conditions (invalid parameters)
  2. Scheduler Unit Tests
    • Test simple DAG topologies (linear, diamond, fork-join)
    • Test cycle detection with specific cycle patterns
    • Test error propagation with injected failures
  3. Memory Manager Unit Tests
    • Test allocation/free sequences
    • Test pool behavior with specific allocation patterns
    • Test fallback behavior when pinned allocation fails

Property-Based Tests

Property-based tests verify universal properties across randomly generated inputs. Each property test runs minimum 100 iterations.

Testing Framework: Google Test with custom property-based testing utilities for CUDA

Property Test Configuration:

// Each property test runs 100+ iterations with random inputs
constexpr int kPropertyTestIterations = 100;

// Random image generator
ImageBuffer generateRandomImage(int minSize, int maxSize, int channels);

// Random DAG generator
TaskGraph generateRandomDAG(int minNodes, int maxNodes, float edgeProbability);

Property Test Annotations: Each property test is annotated with:

// Feature: mini-image-pipe, Property N: [Property Title]
// Validates: Requirements X.Y
TEST(OperatorPropertyTest, GaussianBlurMultiChannel) {
    // Property 1: Gaussian Blur Multi-Channel Support
    // ...
}

Integration Tests

  1. End-to-End Pipeline Tests
    • Test complete pipelines with multiple operators
    • Verify output correctness against reference implementation
  2. Concurrency Tests
    • Test multi-stream execution with varying stream counts
    • Verify no race conditions or data corruption
  3. Performance Benchmarks
    • Measure throughput for standard image sizes
    • Compare separable vs direct convolution performance

Next Steps