Mini-ImagePipe Specification

Purpose

Mini-ImagePipe is a GPU-accelerated image processing pipeline framework built on a task graph (DAG) architecture. The framework is designed for high-throughput video stream processing scenarios, supporting full GPU pipeline execution. It is suitable for industrial applications such as autonomous driving perception, medical image processing, and embedded AI.

Glossary

  • Task_Graph: A directed acyclic graph (DAG) structure representing dependencies between image processing tasks
  • Operator: An image processing operator that performs specific image transformation operations
  • Scheduler: A task scheduler responsible for managing and scheduling task execution within the DAG
  • CUDA_Stream: A CUDA stream for implementing asynchronous concurrent execution on the GPU
  • Pinned_Memory: Page-locked memory for optimizing host-device data transfers
  • Separable_Filter: A separable filter that decomposes 2D convolution into two 1D convolutions for improved performance
  • Halo_Region: A halo region used for handling additional data areas at convolution boundaries
  • Shared_Memory: GPU shared memory for accelerating data access within thread blocks

Requirements

Requirement: Gaussian Blur Operator

As a developer, I want to apply Gaussian blur to images, so that I can reduce noise and smooth images in the processing pipeline.

Scenarios

Scenario: Configurable kernel size

  • WHEN a Gaussian blur operation is requested
  • THEN the Operator SHALL apply a configurable kernel size (3x3, 5x5, 7x7) to the input image

Scenario: Separable filter optimization

  • WHEN processing large images
  • THEN the Operator SHALL use separable filter optimization to decompose 2D convolution into two 1D passes

Scenario: Shared memory with halo regions

  • WHEN executing on GPU
  • THEN the Operator SHALL utilize shared memory with halo regions for efficient boundary handling

Scenario: Reflection padding for boundaries

  • WHEN the input image has edges
  • THEN the Operator SHALL handle boundary pixels using reflection padding

Scenario: Multi-channel support

  • WHEN processing images with varying channel counts
  • THEN the Operator SHALL support single-channel (grayscale) and multi-channel (RGB/RGBA) images

Requirement: Sobel Edge Detection Operator

As a developer, I want to detect edges in images, so that I can identify object boundaries for downstream processing.

Scenarios

Scenario: Sobel kernel computation

  • WHEN a Sobel operation is requested
  • THEN the Operator SHALL compute horizontal and vertical gradients using 3x3 Sobel kernels

Scenario: Gradient magnitude calculation

  • WHEN computing edge magnitude
  • THEN the Operator SHALL calculate the gradient magnitude as sqrt(Gx² + Gy²)

Scenario: Shared memory optimization

  • WHEN executing on GPU
  • THEN the Operator SHALL use shared memory to minimize global memory access

Scenario: Single-channel output

  • WHEN processing any input image
  • THEN the Operator SHALL output gradient magnitude as a single-channel image

Requirement: Resize Operator

As a developer, I want to resize images to different resolutions, so that I can adapt images for various processing stages.

Scenarios

Scenario: Bilinear interpolation

  • WHEN a resize operation is requested with smooth scaling
  • THEN the Operator SHALL support bilinear interpolation for smooth scaling

Scenario: Nearest-neighbor interpolation

  • WHEN downscaling images for fast processing
  • THEN the Operator SHALL support nearest-neighbor interpolation for fast processing

Scenario: Coordinate mapping

  • WHEN the target size is specified
  • THEN the Operator SHALL correctly compute output pixel coordinates from input coordinates

Scenario: Arbitrary scale factors

  • WHEN resizing with any scale factor
  • THEN the Operator SHALL support arbitrary scale factors (both upscaling and downscaling)

Requirement: Color Conversion Operator

As a developer, I want to convert images between color spaces, so that I can prepare images for different processing algorithms.

Scenarios

Scenario: RGB to Grayscale conversion

  • WHEN a color conversion is requested
  • THEN the Operator SHALL support RGB to Grayscale conversion using standard luminance weights

Scenario: Luminance formula

  • WHEN converting RGB to Grayscale
  • THEN the Operator SHALL use the formula: Y = 0.299R + 0.587G + 0.114*B

Scenario: BGR to RGB conversion

  • WHEN a BGR to RGB conversion is requested
  • THEN the Operator SHALL correctly swap channel order

Scenario: Alpha channel preservation

  • WHEN converting RGBA images
  • THEN the Operator SHALL preserve alpha channel when present during color space conversion

Requirement: DAG Task Scheduler

As a developer, I want to define processing pipelines as directed acyclic graphs, so that I can express complex task dependencies and enable parallel execution.

Scenarios

Scenario: Cycle detection

  • WHEN tasks are added to the scheduler
  • THEN the Scheduler SHALL validate that no circular dependencies exist

Scenario: Dependency constraints

  • WHEN executing the task graph
  • THEN the Scheduler SHALL respect all dependency constraints between tasks

Scenario: Concurrent execution

  • WHEN multiple tasks have no dependencies on each other
  • THEN the Scheduler SHALL enable concurrent execution

Scenario: Dependent task notification

  • WHEN a task completes
  • THEN the Scheduler SHALL notify dependent tasks and trigger their execution when ready

Scenario: Error propagation

  • WHEN a task fails during execution
  • THEN the Scheduler SHALL propagate the error and halt dependent tasks

Scenario: Topological sorting

  • WHEN determining execution order
  • THEN the Scheduler SHALL support topological sorting to determine valid execution order

Requirement: CUDA Streams Concurrency

As a developer, I want to process multiple video streams concurrently, so that I can maximize GPU utilization and throughput.

Scenarios

Scenario: Multi-stream assignment

  • WHEN multiple independent tasks are ready
  • THEN the Scheduler SHALL assign them to different CUDA streams for concurrent execution

Scenario: Cross-stream synchronization

  • WHEN a task depends on another task in a different stream
  • THEN the Scheduler SHALL use CUDA events for synchronization

Scenario: Overlapping operations

  • WHEN processing multiple video streams
  • THEN the Scheduler SHALL enable overlapping of upload, compute, and download operations

Scenario: Configurable stream count

  • WHEN configuring the scheduler
  • THEN the Scheduler SHALL support configurable number of CUDA streams (default: 4)

Scenario: Stream synchronization on completion

  • WHEN all tasks complete
  • THEN the Scheduler SHALL synchronize all streams before returning results

Requirement: Pinned Memory Management

As a developer, I want optimized host-device data transfer, so that I can achieve maximum bandwidth for video stream processing.

Scenarios

Scenario: Pinned memory allocation

  • WHEN allocating host memory for data transfer
  • THEN the Memory Manager SHALL use cudaHostAlloc for pinned memory allocation

Scenario: Asynchronous memory copies

  • WHEN transferring data to GPU
  • THEN the Memory Manager SHALL use asynchronous memory copies with CUDA streams

Scenario: Pageable memory fallback

  • WHEN pinned memory allocation fails
  • THEN the Memory Manager SHALL fall back to pageable memory with a warning

Scenario: Memory pool reuse

  • WHEN managing memory allocations
  • THEN the Memory Manager SHALL provide a memory pool to reuse pinned memory allocations and reduce allocation overhead

Scenario: Resource cleanup

  • WHEN the pipeline shuts down
  • THEN the Memory Manager SHALL properly free all pinned memory resources

Requirement: Pipeline Integration

As a developer, I want to chain multiple operators into a complete processing pipeline, so that I can build end-to-end image processing workflows.

Scenarios

Scenario: Pipeline topology

  • WHEN building a pipeline
  • THEN the Pipeline SHALL allow operators to be connected in sequence or parallel branches

Scenario: Automatic buffer allocation

  • WHEN executing a pipeline
  • THEN the Pipeline SHALL automatically manage intermediate buffer allocation

Scenario: Shared output for multiple dependents

  • WHEN the same intermediate result is used by multiple downstream operators
  • THEN the Pipeline SHALL avoid redundant computation

Scenario: Runtime parameter configuration

  • WHEN configuring operators
  • THEN the Pipeline SHALL support runtime configuration of operator parameters without rebuilding the graph

Scenario: Batch processing

  • WHEN processing video streams
  • THEN the Pipeline SHALL support batch processing of multiple frames

Properties

Operator Properties

Property 1: Gaussian Blur Multi-Channel Support For any valid image with 1, 3, or 4 channels and for any kernel size (3x3, 5x5, 7x7), applying Gaussian blur SHALL produce an output image with the same dimensions and channel count as the input. Validates: Requirements Gaussian Blur Operator scenarios 1, 5

Property 2: Separable Filter Equivalence For any valid image and Gaussian kernel, the separable filter implementation (two 1D passes) SHALL produce results equivalent to direct 2D convolution within floating-point tolerance (epsilon < 1e-5). Validates: Requirements Gaussian Blur Operator scenario 2

Property 3: Reflection Padding Boundary Handling For any valid image, applying Gaussian blur SHALL produce valid pixel values at all boundary positions (no NaN, no out-of-range values), and boundary pixels SHALL reflect the expected reflection padding behavior. Validates: Requirements Gaussian Blur Operator scenario 4

Property 4: Sobel Gradient Computation For any image with a known edge pattern, the Sobel operator SHALL compute gradient magnitude as sqrt(Gx² + Gy²) where Gx and Gy are computed using standard 3x3 Sobel kernels. Validates: Requirements Sobel Edge Detection Operator scenarios 1, 2

Property 5: Sobel Single-Channel Output For any input image regardless of channel count, the Sobel operator SHALL produce a single-channel output image with the same width and height as the input. Validates: Requirements Sobel Edge Detection Operator scenario 4

Property 6: Resize Coordinate Mapping For any resize operation with bilinear or nearest-neighbor interpolation, output pixel at (dst_x, dst_y) SHALL be computed from input coordinates (src_x, src_y) where src_x = dst_x * (src_width / dst_width) and src_y = dst_y * (src_height / dst_height). Validates: Requirements Resize Operator scenarios 1, 2, 3

Property 7: Resize Arbitrary Scale Factors For any scale factor s > 0 (both upscaling s > 1 and downscaling s < 1), the resize operator SHALL produce an output image with dimensions (input_width * s, input_height * s) rounded to integers. Validates: Requirements Resize Operator scenario 4

Property 8: RGB to Grayscale Formula For any RGB pixel (R, G, B), the grayscale conversion SHALL produce Y = 0.299R + 0.587G + 0.114*B within floating-point tolerance. Validates: Requirements Color Conversion Operator scenario 2

Property 9: BGR to RGB Channel Swap For any BGR image, converting to RGB SHALL swap the first and third channels such that output[0] = input[2] and output[2] = input[0], with the middle channel unchanged. Validates: Requirements Color Conversion Operator scenario 3

Property 10: Alpha Channel Preservation For any RGBA image undergoing color conversion, the alpha channel value SHALL be preserved unchanged in the output. Validates: Requirements Color Conversion Operator scenario 4

Scheduler Properties

Property 11: DAG Cycle Detection For any task graph, adding an edge that would create a cycle SHALL be rejected, and the graph SHALL remain in a valid acyclic state. Validates: Requirements DAG Task Scheduler scenario 1

Property 12: Dependency Ordering For any valid task graph execution, for all tasks T with dependencies D1, D2, …, Dn, task T SHALL only begin execution after ALL of D1, D2, …, Dn have completed. Validates: Requirements DAG Task Scheduler scenarios 2, 4, 6

Property 13: Error Propagation For any task graph where task T fails, all tasks that depend (directly or transitively) on T SHALL NOT execute, and the error SHALL be propagated to the caller. Validates: Requirements DAG Task Scheduler scenario 5

Property 14: Stream Assignment and Synchronization For any pair of independent tasks (no dependency path between them), they SHALL be assignable to different CUDA streams. For any dependent tasks in different streams, proper CUDA event synchronization SHALL be inserted. Validates: Requirements CUDA Streams Concurrency scenarios 1, 2

Property 15: Stream Synchronization on Completion For any pipeline execution, after execute() returns, all output buffers SHALL contain valid, fully computed results (no partial writes, no race conditions). Validates: Requirements CUDA Streams Concurrency scenario 5

Memory Properties

Property 16: Pinned Memory Async Transfer For any data transfer using the Memory Manager, the transfer SHALL complete correctly and the destination buffer SHALL contain an exact copy of the source data. Validates: Requirements Pinned Memory Management scenarios 1, 2

Property 17: Memory Pool Reuse For any sequence of allocate-free-allocate operations of the same size, the memory pool SHALL reuse previously freed blocks, and the total number of cudaHostAlloc calls SHALL be less than the number of allocate requests. Validates: Requirements Pinned Memory Management scenario 4

Property 18: Memory Cleanup For any pipeline lifecycle (create, execute, shutdown), after shutdown() is called, all pinned memory allocations SHALL be freed (no memory leaks). Validates: Requirements Pinned Memory Management scenario 5

Pipeline Properties

Property 19: Pipeline Topology and Buffer Management For any valid pipeline topology (sequential or parallel branches), the pipeline SHALL automatically allocate intermediate buffers of correct size for all connections. Validates: Requirements Pipeline Integration scenarios 1, 2

Property 20: No Redundant Computation For any task node with multiple downstream dependents, the task SHALL execute exactly once, and all dependents SHALL receive the same output buffer reference. Validates: Requirements Pipeline Integration scenario 3

Property 21: Runtime Parameter Configuration For any operator parameter change at runtime, the next pipeline execution SHALL use the updated parameter value without requiring graph reconstruction. Validates: Requirements Pipeline Integration scenario 4

Property 22: Batch Processing For any batch of N frames, the pipeline SHALL process all N frames and produce N corresponding output frames with correct results. Validates: Requirements Pipeline Integration scenario 5

Error Handling

Operator Errors

Error Condition Handling Strategy
Invalid input dimensions (width/height ≤ 0) Return cudaErrorInvalidValue, log error
Unsupported channel count Return cudaErrorInvalidValue, log error
CUDA kernel launch failure Return CUDA error code, log kernel name and parameters
Device memory allocation failure Return cudaErrorMemoryAllocation, attempt cleanup

Scheduler Errors

Error Condition Handling Strategy
Cycle detected in DAG Reject edge addition, return false, log cycle path
Task execution failure Mark task as FAILED, propagate to dependents, invoke error callback
Stream creation failure Fall back to fewer streams, log warning
Event synchronization failure Return CUDA error, halt execution

Memory Manager Errors

Error Condition Handling Strategy
Pinned memory allocation failure Fall back to pageable memory, log warning
Device memory allocation failure Return nullptr, log error with requested size
Invalid free (double free, invalid pointer) Log error, ignore operation
Async copy failure Return CUDA error code, do not retry

Pipeline Errors

Error Condition Handling Strategy
Invalid operator connection Reject connection, return error
Buffer size mismatch Reallocate buffer, log warning
Batch size exceeds maximum Process in chunks, log info