Mini-ImagePipe Specification

Purpose

Mini-ImagePipe is a GPU-accelerated image processing pipeline framework built on a task graph (DAG) architecture. The framework is designed for high-throughput video stream processing scenarios, supporting full GPU pipeline execution. It is suitable for industrial applications such as autonomous driving perception, medical image processing, and embedded AI.

Glossary

Task_Graph: A directed acyclic graph (DAG) structure representing dependencies between image processing tasks
Operator: An image processing operator that performs specific image transformation operations
Scheduler: A task scheduler responsible for managing and scheduling task execution within the DAG
CUDA_Stream: A CUDA stream for implementing asynchronous concurrent execution on the GPU
Pinned_Memory: Page-locked memory for optimizing host-device data transfers
Separable_Filter: A separable filter that decomposes 2D convolution into two 1D convolutions for improved performance
Halo_Region: A halo region used for handling additional data areas at convolution boundaries
Shared_Memory: GPU shared memory for accelerating data access within thread blocks

Requirements

Requirement: Gaussian Blur Operator

As a developer, I want to apply Gaussian blur to images, so that I can reduce noise and smooth images in the processing pipeline.

Scenarios

Scenario: Configurable kernel size

WHEN a Gaussian blur operation is requested
THEN the Operator SHALL apply a configurable kernel size (3x3, 5x5, 7x7) to the input image

Scenario: Separable filter optimization

WHEN processing large images
THEN the Operator SHALL use separable filter optimization to decompose 2D convolution into two 1D passes

Scenario: Shared memory with halo regions

WHEN executing on GPU
THEN the Operator SHALL utilize shared memory with halo regions for efficient boundary handling

Scenario: Reflection padding for boundaries

WHEN the input image has edges
THEN the Operator SHALL handle boundary pixels using reflection padding

Scenario: Multi-channel support

WHEN processing images with varying channel counts
THEN the Operator SHALL support single-channel (grayscale) and multi-channel (RGB/RGBA) images

Requirement: Sobel Edge Detection Operator

As a developer, I want to detect edges in images, so that I can identify object boundaries for downstream processing.

Scenarios

Scenario: Sobel kernel computation

WHEN a Sobel operation is requested
THEN the Operator SHALL compute horizontal and vertical gradients using 3x3 Sobel kernels

Scenario: Gradient magnitude calculation

WHEN computing edge magnitude
THEN the Operator SHALL calculate the gradient magnitude as sqrt(Gx² + Gy²)

Scenario: Shared memory optimization

WHEN executing on GPU
THEN the Operator SHALL use shared memory to minimize global memory access

Scenario: Single-channel output

WHEN processing any input image
THEN the Operator SHALL output gradient magnitude as a single-channel image

Requirement: Resize Operator

As a developer, I want to resize images to different resolutions, so that I can adapt images for various processing stages.

Scenarios

Scenario: Bilinear interpolation

WHEN a resize operation is requested with smooth scaling
THEN the Operator SHALL support bilinear interpolation for smooth scaling

Scenario: Nearest-neighbor interpolation

WHEN downscaling images for fast processing
THEN the Operator SHALL support nearest-neighbor interpolation for fast processing

Scenario: Coordinate mapping

WHEN the target size is specified
THEN the Operator SHALL correctly compute output pixel coordinates from input coordinates

Scenario: Arbitrary scale factors

WHEN resizing with any scale factor
THEN the Operator SHALL support arbitrary scale factors (both upscaling and downscaling)

Requirement: Color Conversion Operator

As a developer, I want to convert images between color spaces, so that I can prepare images for different processing algorithms.

Scenarios

Scenario: RGB to Grayscale conversion

WHEN a color conversion is requested
THEN the Operator SHALL support RGB to Grayscale conversion using standard luminance weights

Scenario: Luminance formula

WHEN converting RGB to Grayscale
THEN the Operator SHALL use the formula: Y = 0.299R + 0.587G + 0.114*B

Scenario: BGR to RGB conversion

WHEN a BGR to RGB conversion is requested
THEN the Operator SHALL correctly swap channel order

Scenario: Alpha channel preservation

WHEN converting RGBA images
THEN the Operator SHALL preserve alpha channel when present during color space conversion

Requirement: DAG Task Scheduler

As a developer, I want to define processing pipelines as directed acyclic graphs, so that I can express complex task dependencies and enable parallel execution.

Scenarios

Scenario: Cycle detection

WHEN tasks are added to the scheduler
THEN the Scheduler SHALL validate that no circular dependencies exist

Scenario: Dependency constraints

WHEN executing the task graph
THEN the Scheduler SHALL respect all dependency constraints between tasks

Scenario: Concurrent execution

WHEN multiple tasks have no dependencies on each other
THEN the Scheduler SHALL enable concurrent execution

Scenario: Dependent task notification

WHEN a task completes
THEN the Scheduler SHALL notify dependent tasks and trigger their execution when ready

Scenario: Error propagation

WHEN a task fails during execution
THEN the Scheduler SHALL propagate the error and halt dependent tasks

Scenario: Topological sorting

WHEN determining execution order
THEN the Scheduler SHALL support topological sorting to determine valid execution order

Requirement: CUDA Streams Concurrency

As a developer, I want to process multiple video streams concurrently, so that I can maximize GPU utilization and throughput.

Scenarios

Scenario: Multi-stream assignment

WHEN multiple independent tasks are ready
THEN the Scheduler SHALL assign them to different CUDA streams for concurrent execution

Scenario: Cross-stream synchronization

WHEN a task depends on another task in a different stream
THEN the Scheduler SHALL use CUDA events for synchronization

Scenario: Overlapping operations

WHEN processing multiple video streams
THEN the Scheduler SHALL enable overlapping of upload, compute, and download operations

Scenario: Configurable stream count

WHEN configuring the scheduler
THEN the Scheduler SHALL support configurable number of CUDA streams (default: 4)

Scenario: Stream synchronization on completion

WHEN all tasks complete
THEN the Scheduler SHALL synchronize all streams before returning results

Requirement: Pinned Memory Management

As a developer, I want optimized host-device data transfer, so that I can achieve maximum bandwidth for video stream processing.

Scenarios

Scenario: Pinned memory allocation

WHEN allocating host memory for data transfer
THEN the Memory Manager SHALL use cudaHostAlloc for pinned memory allocation

Scenario: Asynchronous memory copies

WHEN transferring data to GPU
THEN the Memory Manager SHALL use asynchronous memory copies with CUDA streams

Scenario: Pageable memory fallback

WHEN pinned memory allocation fails
THEN the Memory Manager SHALL fall back to pageable memory with a warning

Scenario: Memory pool reuse

WHEN managing memory allocations
THEN the Memory Manager SHALL provide a memory pool to reuse pinned memory allocations and reduce allocation overhead

Scenario: Resource cleanup

WHEN the pipeline shuts down
THEN the Memory Manager SHALL properly free all pinned memory resources

Requirement: Pipeline Integration

As a developer, I want to chain multiple operators into a complete processing pipeline, so that I can build end-to-end image processing workflows.

Scenarios

Scenario: Pipeline topology

WHEN building a pipeline
THEN the Pipeline SHALL allow operators to be connected in sequence or parallel branches

Scenario: Automatic buffer allocation

WHEN executing a pipeline
THEN the Pipeline SHALL automatically manage intermediate buffer allocation

Scenario: Shared output for multiple dependents

WHEN the same intermediate result is used by multiple downstream operators
THEN the Pipeline SHALL avoid redundant computation

Scenario: Runtime parameter configuration

WHEN configuring operators
THEN the Pipeline SHALL support runtime configuration of operator parameters without rebuilding the graph

Scenario: Batch processing

WHEN processing video streams
THEN the Pipeline SHALL support batch processing of multiple frames

Properties

Operator Properties

Property 1: Gaussian Blur Multi-Channel Support For any valid image with 1, 3, or 4 channels and for any kernel size (3x3, 5x5, 7x7), applying Gaussian blur SHALL produce an output image with the same dimensions and channel count as the input. Validates: Requirements Gaussian Blur Operator scenarios 1, 5

Property 2: Separable Filter Equivalence For any valid image and Gaussian kernel, the separable filter implementation (two 1D passes) SHALL produce results equivalent to direct 2D convolution within floating-point tolerance (epsilon < 1e-5). Validates: Requirements Gaussian Blur Operator scenario 2

Property 3: Reflection Padding Boundary Handling For any valid image, applying Gaussian blur SHALL produce valid pixel values at all boundary positions (no NaN, no out-of-range values), and boundary pixels SHALL reflect the expected reflection padding behavior. Validates: Requirements Gaussian Blur Operator scenario 4

Property 4: Sobel Gradient Computation For any image with a known edge pattern, the Sobel operator SHALL compute gradient magnitude as sqrt(Gx² + Gy²) where Gx and Gy are computed using standard 3x3 Sobel kernels. Validates: Requirements Sobel Edge Detection Operator scenarios 1, 2

Property 5: Sobel Single-Channel Output For any input image regardless of channel count, the Sobel operator SHALL produce a single-channel output image with the same width and height as the input. Validates: Requirements Sobel Edge Detection Operator scenario 4

Property 6: Resize Coordinate Mapping For any resize operation with bilinear or nearest-neighbor interpolation, output pixel at (dst_x, dst_y) SHALL be computed from input coordinates (src_x, src_y) where src_x = dst_x * (src_width / dst_width) and src_y = dst_y * (src_height / dst_height). Validates: Requirements Resize Operator scenarios 1, 2, 3

Property 7: Resize Arbitrary Scale Factors For any scale factor s > 0 (both upscaling s > 1 and downscaling s < 1), the resize operator SHALL produce an output image with dimensions (input_width * s, input_height * s) rounded to integers. Validates: Requirements Resize Operator scenario 4

Property 8: RGB to Grayscale Formula For any RGB pixel (R, G, B), the grayscale conversion SHALL produce Y = 0.299R + 0.587G + 0.114*B within floating-point tolerance. Validates: Requirements Color Conversion Operator scenario 2

Property 9: BGR to RGB Channel Swap For any BGR image, converting to RGB SHALL swap the first and third channels such that output[0] = input[2] and output[2] = input[0], with the middle channel unchanged. Validates: Requirements Color Conversion Operator scenario 3

Property 10: Alpha Channel Preservation For any RGBA image undergoing color conversion, the alpha channel value SHALL be preserved unchanged in the output. Validates: Requirements Color Conversion Operator scenario 4

Scheduler Properties

Property 11: DAG Cycle Detection For any task graph, adding an edge that would create a cycle SHALL be rejected, and the graph SHALL remain in a valid acyclic state. Validates: Requirements DAG Task Scheduler scenario 1

Property 12: Dependency Ordering For any valid task graph execution, for all tasks T with dependencies D1, D2, …, Dn, task T SHALL only begin execution after ALL of D1, D2, …, Dn have completed. Validates: Requirements DAG Task Scheduler scenarios 2, 4, 6

Property 13: Error Propagation For any task graph where task T fails, all tasks that depend (directly or transitively) on T SHALL NOT execute, and the error SHALL be propagated to the caller. Validates: Requirements DAG Task Scheduler scenario 5

Property 14: Stream Assignment and Synchronization For any pair of independent tasks (no dependency path between them), they SHALL be assignable to different CUDA streams. For any dependent tasks in different streams, proper CUDA event synchronization SHALL be inserted. Validates: Requirements CUDA Streams Concurrency scenarios 1, 2

Property 15: Stream Synchronization on Completion For any pipeline execution, after execute() returns, all output buffers SHALL contain valid, fully computed results (no partial writes, no race conditions). Validates: Requirements CUDA Streams Concurrency scenario 5

Memory Properties

Property 16: Pinned Memory Async Transfer For any data transfer using the Memory Manager, the transfer SHALL complete correctly and the destination buffer SHALL contain an exact copy of the source data. Validates: Requirements Pinned Memory Management scenarios 1, 2

Property 17: Memory Pool Reuse For any sequence of allocate-free-allocate operations of the same size, the memory pool SHALL reuse previously freed blocks, and the total number of cudaHostAlloc calls SHALL be less than the number of allocate requests. Validates: Requirements Pinned Memory Management scenario 4

Property 18: Memory Cleanup For any pipeline lifecycle (create, execute, shutdown), after shutdown() is called, all pinned memory allocations SHALL be freed (no memory leaks). Validates: Requirements Pinned Memory Management scenario 5

Pipeline Properties

Property 19: Pipeline Topology and Buffer Management For any valid pipeline topology (sequential or parallel branches), the pipeline SHALL automatically allocate intermediate buffers of correct size for all connections. Validates: Requirements Pipeline Integration scenarios 1, 2

Property 20: No Redundant Computation For any task node with multiple downstream dependents, the task SHALL execute exactly once, and all dependents SHALL receive the same output buffer reference. Validates: Requirements Pipeline Integration scenario 3

Property 21: Runtime Parameter Configuration For any operator parameter change at runtime, the next pipeline execution SHALL use the updated parameter value without requiring graph reconstruction. Validates: Requirements Pipeline Integration scenario 4

Property 22: Batch Processing For any batch of N frames, the pipeline SHALL process all N frames and produce N corresponding output frames with correct results. Validates: Requirements Pipeline Integration scenario 5

Error Handling

Operator Errors

Error Condition	Handling Strategy
Invalid input dimensions (width/height ≤ 0)	Return `cudaErrorInvalidValue`, log error
Unsupported channel count	Return `cudaErrorInvalidValue`, log error
CUDA kernel launch failure	Return CUDA error code, log kernel name and parameters
Device memory allocation failure	Return `cudaErrorMemoryAllocation`, attempt cleanup

Scheduler Errors

Error Condition	Handling Strategy
Cycle detected in DAG	Reject edge addition, return false, log cycle path
Task execution failure	Mark task as FAILED, propagate to dependents, invoke error callback
Stream creation failure	Fall back to fewer streams, log warning
Event synchronization failure	Return CUDA error, halt execution

Memory Manager Errors

Error Condition	Handling Strategy
Pinned memory allocation failure	Fall back to pageable memory, log warning
Device memory allocation failure	Return nullptr, log error with requested size
Invalid free (double free, invalid pointer)	Log error, ignore operation
Async copy failure	Return CUDA error code, do not retry

Pipeline Errors

Error Condition	Handling Strategy
Invalid operator connection	Reject connection, return error
Buffer size mismatch	Reallocate buffer, log warning
Batch size exceeds maximum	Process in chunks, log info