RFC 0004: Stream Manager and Concurrency System

Status

Status: Accepted Created: 2024 Last Updated: 2024

Overview

Design a stream management system for concurrent CUDA stream operations, enabling overlapping data transfers with computation and supporting multiple concurrent inference requests.

Motivation

  1. Overlap transfer and compute: While one batch computes, transfer next batch
  2. Multi-request serving: Handle multiple inference requests concurrently
  3. Resource isolation: Separate streams for different operations
  4. Synchronization control: Fine-grained control over execution order

Design

Stream Hierarchy

1
2
3
4
5
StreamManager
├── compute_stream: Default GEMM execution
├── copy_stream_0: H2D transfers (stream 1)
├── copy_stream_1: D2H transfers (stream 2)
└── extra_streams: User-created streams (up to 16)

API Design

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
enum class StreamPriority {
    LOW,
    NORMAL,
    HIGH
};

class StreamManager {
public:
    explicit StreamManager(int max_streams = 8);
    ~StreamManager();

    // Stream creation
    cudaStream_t create_stream(StreamPriority priority = StreamPriority::NORMAL);

    // Stream access
    cudaStream_t get_compute_stream() const;
    cudaStream_t get_copy_stream(int index = 0) const;

    // Synchronization
    void sync_stream(cudaStream_t stream);
    void sync_all_streams();

    // Stream ordering
    void insert_event(cudaStream_t stream, cudaEvent_t event);
    void wait_event(cudaStream_t stream, cudaEvent_t event);

    // Statistics
    int active_streams() const;
    size_t total_streams_created() const;

    // Cleanup
    void destroy_stream(cudaStream_t stream);

private:
    std::vector<cudaStream_t> streams_;
    int max_streams_;
    int next_stream_id_;
};

Concurrent Execution Pattern

1
2
3
4
5
Stream 0 (compute):  [GEMM batch 1] ---- [GEMM batch 2] ----
Stream 1 (H2D):      [Copy batch 2] ---- [Copy batch 3] ----
Stream 2 (D2H):      ---------------- [Results batch 1] ----

Timeline →

Stream Priority Mapping

Priority CUDA Priority Use Case
HIGH cudaStreamHighPriority Compute kernels
NORMAL cudaStreamDefaultPriority Data transfers
LOW cudaStreamLowPriority Logging, profiling

Synchronization Guarantees

Operation Blocking Notes
sync_stream() Yes Waits for all work on stream
sync_all_streams() Yes Waits for all managed streams
insert_event() No Non-blocking event record
wait_event() Conditional Blocks until event reached

Error Handling

Condition Behavior
Max streams exceeded Throws std::runtime_error
Invalid stream handle Throws std::invalid_argument
CUDA error in stream creation Throws CudaException

Performance Targets

Scenario Target Metric
Single stream 1.0x Baseline
2 streams (overlap) 1.3-1.5x Compute + transfer overlap
4 streams (batching) 1.8-2.0x Throughput vs single stream

Testing Strategy

  1. Concurrent execution: Verify overlap actually occurs
  2. Synchronization correctness: No race conditions
  3. Resource cleanup: No leaked streams or events
  4. Performance measurement: Actual speedup from concurrency
  5. Stress test: Max streams, rapid create/destroy cycles

Implementation Files

  • include/stream_manager.h - StreamManager class
  • src/stream_manager.cu - Implementation
  • tests/test_stream_manager.cpp - Unit tests

References


Back to top

MIT License | A learning project for the CUDA community