RFC 0002: Memory Pool Subsystem

Status

Status: Accepted Created: 2024 Last Updated: 2024 Supersedes: None

Overview

Design a memory pool subsystem for the Mini-Inference Engine to reduce CUDA memory allocation overhead, prevent memory fragmentation, and provide deterministic memory management for inference workloads.

Motivation

Frequent cudaMalloc/cudaFree calls cause:

  1. High overhead (~10-50us per allocation)
  2. Memory fragmentation during variable tensor lifetimes
  3. Inability to track memory usage statistics
  4. No protection against out-of-memory conditions

Design

Architecture

1
2
3
4
5
MemoryPool
├── Capacity: Fixed size (default 256MB)
├── Free List: Sorted by address for coalescing
├── Allocation Tracking: Used/Free statistics
└── Block Header: Size, is_free, prev/next pointers

Memory Block Layout

1
2
3
4
5
6
7
+---------------------------------------------------------------+
| Block Header (16 bytes)                                       |
+-------------------+-------------------+-----------------------+
| size (8 bytes)    | is_free (1 byte)  | reserved (7 bytes)    |
+-------------------+-------------------+-----------------------+
| User Data (size bytes, 256-byte aligned)                      |
+---------------------------------------------------------------+

API Design

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
class MemoryPool {
public:
    explicit MemoryPool(size_t capacity_bytes);
    ~MemoryPool();

    // Non-copyable, movable
    MemoryPool(const MemoryPool&) = delete;
    MemoryPool& operator=(const MemoryPool&) = delete;
    MemoryPool(MemoryPool&&) noexcept;
    MemoryPool& operator=(MemoryPool&&) noexcept;

    // Allocation
    float* allocate(size_t bytes);

    // Deallocation
    void deallocate(float* ptr);

    // Statistics
    size_t capacity() const;
    size_t used() const;
    size_t free_space() const;
    float utilization() const;
    size_t fragmentation_ratio() const;

    // Diagnostics
    void print_memory_map() const;
    void validate_integrity() const;

private:
    struct Block;
    Block* find_best_fit(size_t bytes);
    void coalesce_adjacent_blocks();
    void split_block(Block* block, size_t bytes);

    void* pool_ptr_;       // CUDA memory pool base
    size_t capacity_;
    size_t used_;
    std::vector<Block> blocks_;  // Free list (sorted by address)
};

Allocation Algorithm

  1. Best-fit: Find smallest free block that satisfies request
  2. Splitting: If block is significantly larger, split into allocated + remaining free
  3. Coalescing: Merge adjacent free blocks on deallocation
  4. Alignment: All allocations aligned to 256 bytes for CUDA coalescing

Error Handling

Condition Behavior
Request > capacity Returns nullptr
No contiguous block large enough Returns nullptr
Double free Throws std::runtime_error
Invalid pointer Throws std::invalid_argument

Memory Budget

Scenario Pool Size Notes
MNIST inference (batch=1) 4 MB Minimal workload
MNIST inference (batch=256) 128 MB Production workload
Large models 512 MB+ Configurable

Testing Strategy

  1. Allocation correctness: Sequential, interleaved, random sizes
  2. Coalescing: Verify adjacent free blocks merge
  3. Fragmentation: Measure fragmentation ratio under various patterns
  4. Edge cases: Zero-size, max-size, boundary alignment
  5. Performance: Allocation/deallocation throughput vs raw cudaMalloc

Implementation Files

  • include/memory_pool.h - Header with PooledMemory class
  • src/memory_pool.cu - Implementation (inline in header for templates)
  • tests/test_memory_pool.cpp - Unit tests

Alternatives Considered

Alternative Pros Cons Decision
cudaMallocAsync Built-in, fast CUDA 11.2+ only, less control Rejected (compatibility)
bump allocator Simplest, fastest No deallocation support Rejected (need reuse)
slab allocator Good for fixed sizes Complex for variable tensors Partial (used for small allocs)

References


Back to top

MIT License | A learning project for the CUDA community