Mini-ImagePipe Specification
Purpose
Mini-ImagePipe is a GPU-accelerated image processing pipeline framework built on a task graph (DAG) architecture. The framework is designed for high-throughput video stream processing scenarios, supporting full GPU pipeline execution. It is suitable for industrial applications such as autonomous driving perception, medical image processing, and embedded AI.
Glossary
- Task_Graph: A directed acyclic graph (DAG) structure representing dependencies between image processing tasks
- Operator: An image processing operator that performs specific image transformation operations
- Scheduler: A task scheduler responsible for managing and scheduling task execution within the DAG
- CUDA_Stream: A CUDA stream for implementing asynchronous concurrent execution on the GPU
- Pinned_Memory: Page-locked memory for optimizing host-device data transfers
- Separable_Filter: A separable filter that decomposes 2D convolution into two 1D convolutions for improved performance
- Halo_Region: A halo region used for handling additional data areas at convolution boundaries
- Shared_Memory: GPU shared memory for accelerating data access within thread blocks
Requirements
Requirement: Gaussian Blur Operator
As a developer, I want to apply Gaussian blur to images, so that I can reduce noise and smooth images in the processing pipeline.
Scenarios
Scenario: Configurable kernel size
- WHEN a Gaussian blur operation is requested
- THEN the Operator SHALL apply a configurable kernel size (3x3, 5x5, 7x7) to the input image
Scenario: Separable filter optimization
- WHEN processing large images
- THEN the Operator SHALL use separable filter optimization to decompose 2D convolution into two 1D passes
Scenario: Shared memory with halo regions
- WHEN executing on GPU
- THEN the Operator SHALL utilize shared memory with halo regions for efficient boundary handling
Scenario: Reflection padding for boundaries
- WHEN the input image has edges
- THEN the Operator SHALL handle boundary pixels using reflection padding
Scenario: Multi-channel support
- WHEN processing images with varying channel counts
- THEN the Operator SHALL support single-channel (grayscale) and multi-channel (RGB/RGBA) images
Requirement: Sobel Edge Detection Operator
As a developer, I want to detect edges in images, so that I can identify object boundaries for downstream processing.
Scenarios
Scenario: Sobel kernel computation
- WHEN a Sobel operation is requested
- THEN the Operator SHALL compute horizontal and vertical gradients using 3x3 Sobel kernels
Scenario: Gradient magnitude calculation
- WHEN computing edge magnitude
- THEN the Operator SHALL calculate the gradient magnitude as sqrt(Gx² + Gy²)
Scenario: Shared memory optimization
- WHEN executing on GPU
- THEN the Operator SHALL use shared memory to minimize global memory access
Scenario: Single-channel output
- WHEN processing any input image
- THEN the Operator SHALL output gradient magnitude as a single-channel image
Requirement: Resize Operator
As a developer, I want to resize images to different resolutions, so that I can adapt images for various processing stages.
Scenarios
Scenario: Bilinear interpolation
- WHEN a resize operation is requested with smooth scaling
- THEN the Operator SHALL support bilinear interpolation for smooth scaling
Scenario: Nearest-neighbor interpolation
- WHEN downscaling images for fast processing
- THEN the Operator SHALL support nearest-neighbor interpolation for fast processing
Scenario: Coordinate mapping
- WHEN the target size is specified
- THEN the Operator SHALL correctly compute output pixel coordinates from input coordinates
Scenario: Arbitrary scale factors
- WHEN resizing with any scale factor
- THEN the Operator SHALL support arbitrary scale factors (both upscaling and downscaling)
Requirement: Color Conversion Operator
As a developer, I want to convert images between color spaces, so that I can prepare images for different processing algorithms.
Scenarios
Scenario: RGB to Grayscale conversion
- WHEN a color conversion is requested
- THEN the Operator SHALL support RGB to Grayscale conversion using standard luminance weights
Scenario: Luminance formula
- WHEN converting RGB to Grayscale
- THEN the Operator SHALL use the formula: Y = 0.299R + 0.587G + 0.114*B
Scenario: BGR to RGB conversion
- WHEN a BGR to RGB conversion is requested
- THEN the Operator SHALL correctly swap channel order
Scenario: Alpha channel preservation
- WHEN converting RGBA images
- THEN the Operator SHALL preserve alpha channel when present during color space conversion
Requirement: DAG Task Scheduler
As a developer, I want to define processing pipelines as directed acyclic graphs, so that I can express complex task dependencies and enable parallel execution.
Scenarios
Scenario: Cycle detection
- WHEN tasks are added to the scheduler
- THEN the Scheduler SHALL validate that no circular dependencies exist
Scenario: Dependency constraints
- WHEN executing the task graph
- THEN the Scheduler SHALL respect all dependency constraints between tasks
Scenario: Concurrent execution
- WHEN multiple tasks have no dependencies on each other
- THEN the Scheduler SHALL enable concurrent execution
Scenario: Dependent task notification
- WHEN a task completes
- THEN the Scheduler SHALL notify dependent tasks and trigger their execution when ready
Scenario: Error propagation
- WHEN a task fails during execution
- THEN the Scheduler SHALL propagate the error and halt dependent tasks
Scenario: Topological sorting
- WHEN determining execution order
- THEN the Scheduler SHALL support topological sorting to determine valid execution order
Requirement: CUDA Streams Concurrency
As a developer, I want to process multiple video streams concurrently, so that I can maximize GPU utilization and throughput.
Scenarios
Scenario: Multi-stream assignment
- WHEN multiple independent tasks are ready
- THEN the Scheduler SHALL assign them to different CUDA streams for concurrent execution
Scenario: Cross-stream synchronization
- WHEN a task depends on another task in a different stream
- THEN the Scheduler SHALL use CUDA events for synchronization
Scenario: Overlapping operations
- WHEN processing multiple video streams
- THEN the Scheduler SHALL enable overlapping of upload, compute, and download operations
Scenario: Configurable stream count
- WHEN configuring the scheduler
- THEN the Scheduler SHALL support configurable number of CUDA streams (default: 4)
Scenario: Stream synchronization on completion
- WHEN all tasks complete
- THEN the Scheduler SHALL synchronize all streams before returning results
Requirement: Pinned Memory Management
As a developer, I want optimized host-device data transfer, so that I can achieve maximum bandwidth for video stream processing.
Scenarios
Scenario: Pinned memory allocation
- WHEN allocating host memory for data transfer
- THEN the Memory Manager SHALL use cudaHostAlloc for pinned memory allocation
Scenario: Asynchronous memory copies
- WHEN transferring data to GPU
- THEN the Memory Manager SHALL use asynchronous memory copies with CUDA streams
Scenario: Pageable memory fallback
- WHEN pinned memory allocation fails
- THEN the Memory Manager SHALL fall back to pageable memory with a warning
Scenario: Memory pool reuse
- WHEN managing memory allocations
- THEN the Memory Manager SHALL provide a memory pool to reuse pinned memory allocations and reduce allocation overhead
Scenario: Resource cleanup
- WHEN the pipeline shuts down
- THEN the Memory Manager SHALL properly free all pinned memory resources
Requirement: Pipeline Integration
As a developer, I want to chain multiple operators into a complete processing pipeline, so that I can build end-to-end image processing workflows.
Scenarios
Scenario: Pipeline topology
- WHEN building a pipeline
- THEN the Pipeline SHALL allow operators to be connected in sequence or parallel branches
Scenario: Automatic buffer allocation
- WHEN executing a pipeline
- THEN the Pipeline SHALL automatically manage intermediate buffer allocation
Scenario: Shared output for multiple dependents
- WHEN the same intermediate result is used by multiple downstream operators
- THEN the Pipeline SHALL avoid redundant computation
Scenario: Runtime parameter configuration
- WHEN configuring operators
- THEN the Pipeline SHALL support runtime configuration of operator parameters without rebuilding the graph
Scenario: Batch processing
- WHEN processing video streams
- THEN the Pipeline SHALL support batch processing of multiple frames
Properties
Operator Properties
Property 1: Gaussian Blur Multi-Channel Support For any valid image with 1, 3, or 4 channels and for any kernel size (3x3, 5x5, 7x7), applying Gaussian blur SHALL produce an output image with the same dimensions and channel count as the input. Validates: Requirements Gaussian Blur Operator scenarios 1, 5
Property 2: Separable Filter Equivalence For any valid image and Gaussian kernel, the separable filter implementation (two 1D passes) SHALL produce results equivalent to direct 2D convolution within floating-point tolerance (epsilon < 1e-5). Validates: Requirements Gaussian Blur Operator scenario 2
Property 3: Reflection Padding Boundary Handling For any valid image, applying Gaussian blur SHALL produce valid pixel values at all boundary positions (no NaN, no out-of-range values), and boundary pixels SHALL reflect the expected reflection padding behavior. Validates: Requirements Gaussian Blur Operator scenario 4
Property 4: Sobel Gradient Computation For any image with a known edge pattern, the Sobel operator SHALL compute gradient magnitude as sqrt(Gx² + Gy²) where Gx and Gy are computed using standard 3x3 Sobel kernels. Validates: Requirements Sobel Edge Detection Operator scenarios 1, 2
Property 5: Sobel Single-Channel Output For any input image regardless of channel count, the Sobel operator SHALL produce a single-channel output image with the same width and height as the input. Validates: Requirements Sobel Edge Detection Operator scenario 4
Property 6: Resize Coordinate Mapping For any resize operation with bilinear or nearest-neighbor interpolation, output pixel at (dst_x, dst_y) SHALL be computed from input coordinates (src_x, src_y) where src_x = dst_x * (src_width / dst_width) and src_y = dst_y * (src_height / dst_height). Validates: Requirements Resize Operator scenarios 1, 2, 3
Property 7: Resize Arbitrary Scale Factors For any scale factor s > 0 (both upscaling s > 1 and downscaling s < 1), the resize operator SHALL produce an output image with dimensions (input_width * s, input_height * s) rounded to integers. Validates: Requirements Resize Operator scenario 4
Property 8: RGB to Grayscale Formula For any RGB pixel (R, G, B), the grayscale conversion SHALL produce Y = 0.299R + 0.587G + 0.114*B within floating-point tolerance. Validates: Requirements Color Conversion Operator scenario 2
Property 9: BGR to RGB Channel Swap For any BGR image, converting to RGB SHALL swap the first and third channels such that output[0] = input[2] and output[2] = input[0], with the middle channel unchanged. Validates: Requirements Color Conversion Operator scenario 3
Property 10: Alpha Channel Preservation For any RGBA image undergoing color conversion, the alpha channel value SHALL be preserved unchanged in the output. Validates: Requirements Color Conversion Operator scenario 4
Scheduler Properties
Property 11: DAG Cycle Detection For any task graph, adding an edge that would create a cycle SHALL be rejected, and the graph SHALL remain in a valid acyclic state. Validates: Requirements DAG Task Scheduler scenario 1
Property 12: Dependency Ordering For any valid task graph execution, for all tasks T with dependencies D1, D2, …, Dn, task T SHALL only begin execution after ALL of D1, D2, …, Dn have completed. Validates: Requirements DAG Task Scheduler scenarios 2, 4, 6
Property 13: Error Propagation For any task graph where task T fails, all tasks that depend (directly or transitively) on T SHALL NOT execute, and the error SHALL be propagated to the caller. Validates: Requirements DAG Task Scheduler scenario 5
Property 14: Stream Assignment and Synchronization For any pair of independent tasks (no dependency path between them), they SHALL be assignable to different CUDA streams. For any dependent tasks in different streams, proper CUDA event synchronization SHALL be inserted. Validates: Requirements CUDA Streams Concurrency scenarios 1, 2
Property 15: Stream Synchronization on Completion For any pipeline execution, after execute() returns, all output buffers SHALL contain valid, fully computed results (no partial writes, no race conditions). Validates: Requirements CUDA Streams Concurrency scenario 5
Memory Properties
Property 16: Pinned Memory Async Transfer For any data transfer using the Memory Manager, the transfer SHALL complete correctly and the destination buffer SHALL contain an exact copy of the source data. Validates: Requirements Pinned Memory Management scenarios 1, 2
Property 17: Memory Pool Reuse For any sequence of allocate-free-allocate operations of the same size, the memory pool SHALL reuse previously freed blocks, and the total number of cudaHostAlloc calls SHALL be less than the number of allocate requests. Validates: Requirements Pinned Memory Management scenario 4
Property 18: Memory Cleanup For any pipeline lifecycle (create, execute, shutdown), after shutdown() is called, all pinned memory allocations SHALL be freed (no memory leaks). Validates: Requirements Pinned Memory Management scenario 5
Pipeline Properties
Property 19: Pipeline Topology and Buffer Management For any valid pipeline topology (sequential or parallel branches), the pipeline SHALL automatically allocate intermediate buffers of correct size for all connections. Validates: Requirements Pipeline Integration scenarios 1, 2
Property 20: No Redundant Computation For any task node with multiple downstream dependents, the task SHALL execute exactly once, and all dependents SHALL receive the same output buffer reference. Validates: Requirements Pipeline Integration scenario 3
Property 21: Runtime Parameter Configuration For any operator parameter change at runtime, the next pipeline execution SHALL use the updated parameter value without requiring graph reconstruction. Validates: Requirements Pipeline Integration scenario 4
Property 22: Batch Processing For any batch of N frames, the pipeline SHALL process all N frames and produce N corresponding output frames with correct results. Validates: Requirements Pipeline Integration scenario 5
Error Handling
Operator Errors
| Error Condition | Handling Strategy |
|---|---|
| Invalid input dimensions (width/height ≤ 0) | Return cudaErrorInvalidValue, log error |
| Unsupported channel count | Return cudaErrorInvalidValue, log error |
| CUDA kernel launch failure | Return CUDA error code, log kernel name and parameters |
| Device memory allocation failure | Return cudaErrorMemoryAllocation, attempt cleanup |
Scheduler Errors
| Error Condition | Handling Strategy |
|---|---|
| Cycle detected in DAG | Reject edge addition, return false, log cycle path |
| Task execution failure | Mark task as FAILED, propagate to dependents, invoke error callback |
| Stream creation failure | Fall back to fewer streams, log warning |
| Event synchronization failure | Return CUDA error, halt execution |
Memory Manager Errors
| Error Condition | Handling Strategy |
|---|---|
| Pinned memory allocation failure | Fall back to pageable memory, log warning |
| Device memory allocation failure | Return nullptr, log error with requested size |
| Invalid free (double free, invalid pointer) | Log error, ignore operation |
| Async copy failure | Return CUDA error code, do not retry |
Pipeline Errors
| Error Condition | Handling Strategy |
|---|---|
| Invalid operator connection | Reject connection, return error |
| Buffer size mismatch | Reallocate buffer, log warning |
| Batch size exceeds maximum | Process in chunks, log info |