RFC 0004: Stream Manager and Concurrency System
Status
Status: Accepted
Created: 2024
Last Updated: 2024
Overview
Design a stream management system for concurrent CUDA stream operations, enabling overlapping data transfers with computation and supporting multiple concurrent inference requests.
Motivation
Overlap transfer and compute : While one batch computes, transfer next batch
Multi-request serving : Handle multiple inference requests concurrently
Resource isolation : Separate streams for different operations
Synchronization control : Fine-grained control over execution order
Design
Stream Hierarchy
1
2
3
4
5
StreamManager
├── compute_stream: Default GEMM execution
├── copy_stream_0: H2D transfers (stream 1)
├── copy_stream_1: D2H transfers (stream 2)
└── extra_streams: User-created streams (up to 16)
API Design
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
enum class StreamPriority {
LOW ,
NORMAL ,
HIGH
};
class StreamManager {
public:
explicit StreamManager ( int max_streams = 8 );
~ StreamManager ();
// Stream creation
cudaStream_t create_stream ( StreamPriority priority = StreamPriority :: NORMAL );
// Stream access
cudaStream_t get_compute_stream () const ;
cudaStream_t get_copy_stream ( int index = 0 ) const ;
// Synchronization
void sync_stream ( cudaStream_t stream );
void sync_all_streams ();
// Stream ordering
void insert_event ( cudaStream_t stream , cudaEvent_t event );
void wait_event ( cudaStream_t stream , cudaEvent_t event );
// Statistics
int active_streams () const ;
size_t total_streams_created () const ;
// Cleanup
void destroy_stream ( cudaStream_t stream );
private:
std :: vector < cudaStream_t > streams_ ;
int max_streams_ ;
int next_stream_id_ ;
};
Concurrent Execution Pattern
1
2
3
4
5
Stream 0 (compute): [GEMM batch 1] ---- [GEMM batch 2] ----
Stream 1 (H2D): [Copy batch 2] ---- [Copy batch 3] ----
Stream 2 (D2H): ---------------- [Results batch 1] ----
Timeline →
Stream Priority Mapping
Priority
CUDA Priority
Use Case
HIGH
cudaStreamHighPriority
Compute kernels
NORMAL
cudaStreamDefaultPriority
Data transfers
LOW
cudaStreamLowPriority
Logging, profiling
Synchronization Guarantees
Operation
Blocking
Notes
sync_stream()
Yes
Waits for all work on stream
sync_all_streams()
Yes
Waits for all managed streams
insert_event()
No
Non-blocking event record
wait_event()
Conditional
Blocks until event reached
Error Handling
Condition
Behavior
Max streams exceeded
Throws std::runtime_error
Invalid stream handle
Throws std::invalid_argument
CUDA error in stream creation
Throws CudaException
Scenario
Target
Metric
Single stream
1.0x
Baseline
2 streams (overlap)
1.3-1.5x
Compute + transfer overlap
4 streams (batching)
1.8-2.0x
Throughput vs single stream
Testing Strategy
Concurrent execution : Verify overlap actually occurs
Synchronization correctness : No race conditions
Resource cleanup : No leaked streams or events
Performance measurement : Actual speedup from concurrency
Stress test : Max streams, rapid create/destroy cycles
Implementation Files
include/stream_manager.h - StreamManager class
src/stream_manager.cu - Implementation
tests/test_stream_manager.cpp - Unit tests
References