RFC 0002: Memory Pool Subsystem

Status

Status: Accepted Created: 2024 Last Updated: 2024 Supersedes: None

Overview

Design a memory pool subsystem for the Mini-Inference Engine to reduce CUDA memory allocation overhead, prevent memory fragmentation, and provide deterministic memory management for inference workloads.

Motivation

Frequent cudaMalloc/cudaFree calls cause:

High overhead (~10-50us per allocation)
Memory fragmentation during variable tensor lifetimes
Inability to track memory usage statistics
No protection against out-of-memory conditions

Design

Architecture

MemoryPool
├── Capacity: Fixed size (default 256MB)
├── Free List: Sorted by address for coalescing
├── Allocation Tracking: Used/Free statistics
└── Block Header: Size, is_free, prev/next pointers

Memory Block Layout

+---------------------------------------------------------------+
| Block Header (16 bytes)                                       |
+-------------------+-------------------+-----------------------+
| size (8 bytes)    | is_free (1 byte)  | reserved (7 bytes)    |
+-------------------+-------------------+-----------------------+
| User Data (size bytes, 256-byte aligned)                      |
+---------------------------------------------------------------+

API Design

class MemoryPool {
public:
    explicit MemoryPool(size_t capacity_bytes);
    ~MemoryPool();

    // Non-copyable, movable
    MemoryPool(const MemoryPool&) = delete;
    MemoryPool& operator=(const MemoryPool&) = delete;
    MemoryPool(MemoryPool&&) noexcept;
    MemoryPool& operator=(MemoryPool&&) noexcept;

    // Allocation
    float* allocate(size_t bytes);

    // Deallocation
    void deallocate(float* ptr);

    // Statistics
    size_t capacity() const;
    size_t used() const;
    size_t free_space() const;
    float utilization() const;
    size_t fragmentation_ratio() const;

    // Diagnostics
    void print_memory_map() const;
    void validate_integrity() const;

private:
    struct Block;
    Block* find_best_fit(size_t bytes);
    void coalesce_adjacent_blocks();
    void split_block(Block* block, size_t bytes);

    void* pool_ptr_;       // CUDA memory pool base
    size_t capacity_;
    size_t used_;
    std::vector<Block> blocks_;  // Free list (sorted by address)
};

Allocation Algorithm

Best-fit: Find smallest free block that satisfies request
Splitting: If block is significantly larger, split into allocated + remaining free
Coalescing: Merge adjacent free blocks on deallocation
Alignment: All allocations aligned to 256 bytes for CUDA coalescing

Error Handling

Condition	Behavior
Request > capacity	Returns nullptr
No contiguous block large enough	Returns nullptr
Double free	Throws `std::runtime_error`
Invalid pointer	Throws `std::invalid_argument`

Memory Budget

Scenario	Pool Size	Notes
MNIST inference (batch=1)	4 MB	Minimal workload
MNIST inference (batch=256)	128 MB	Production workload
Large models	512 MB+	Configurable

Testing Strategy

Allocation correctness: Sequential, interleaved, random sizes
Coalescing: Verify adjacent free blocks merge
Fragmentation: Measure fragmentation ratio under various patterns
Edge cases: Zero-size, max-size, boundary alignment
Performance: Allocation/deallocation throughput vs raw cudaMalloc

Implementation Files

include/memory_pool.h - Header with PooledMemory class
src/memory_pool.cu - Implementation (inline in header for templates)
tests/test_memory_pool.cpp - Unit tests

Alternatives Considered

Alternative	Pros	Cons	Decision
cudaMallocAsync	Built-in, fast	CUDA 11.2+ only, less control	Rejected (compatibility)
bump allocator	Simplest, fastest	No deallocation support	Rejected (need reuse)
slab allocator	Good for fixed sizes	Complex for variable tensors	Partial (used for small allocs)