RFC 0005: Auto-Tuner System

Status

Status: Accepted Created: 2024 Last Updated: 2024

Overview

Design an auto-tuning system that automatically selects optimal GEMM kernel configurations for given matrix dimensions and GPU architectures, eliminating manual parameter tuning.

Motivation

Architecture diversity: Different GPUs have different optimal block sizes
Dimension sensitivity: Best config varies with M, N, K dimensions
Eliminate guesswork: Automatic search replaces manual tuning
Adaptability: Re-tune when workload characteristics change

Design

Tuning Space

The auto-tuner searches over these parameters:

Parameter	Range	Description
BLOCK_M	{32, 64, 128, 256}	Tile row size
BLOCK_N	{32, 64, 128, 256}	Tile column size
BLOCK_K	{8, 16, 32}	K dimension tile size
THREADS_PER_BLOCK	{128, 256, 512}	Thread block size
KERNEL_VARIANT	{naive, tiled, optimized, vectorized}	Which kernel

API Design

struct TuneResult {
    int block_m;
    int block_n;
    int block_k;
    int threads_per_block;
    int kernel_variant;
    float execution_time_ms;
    float gflops;
};

class AutoTuner {
public:
    explicit AutoTuner(StreamManager* streams = nullptr);

    // Tune for specific dimensions
    TuneResult tune(int M, int N, int K,
                    const std::vector<int>& candidates = {});

    // Get cached result (if exists)
    bool get_cached(int M, int N, int K, TuneResult& result);

    // Cache a result manually
    void cache(int M, int N, int K, const TuneResult& result);

    // Statistics
    size_t cache_size() const;
    float cache_hit_rate() const;
    size_t total_tunes() const;

    // Clear cache
    void clear_cache();

private:
    struct CacheKey {
        int M, N, K;
        bool operator<(const CacheKey& other) const;
        bool operator==(const CacheKey& other) const;
    };

    std::map<CacheKey, TuneResult> cache_;
    StreamManager* streams_;
    size_t total_tunes_;
    size_t cache_hits_;
};

Tuning Algorithm

1. Check cache for (M, N, K) → HIT: return immediately
2. Generate candidate configurations
3. For each candidate:
   a. Allocate test buffers
   b. Run kernel 10 times (warmup + measurement)
   c. Record median execution time
4. Select configuration with minimum time
5. Store in cache and return

Cache Key Normalization

To maximize cache hits, dimensions are rounded to nearest power of 2:

normalize(M) = 2^ceil(log2(M))

Candidate Generation Strategy

M, N, K Range	Strategy	Candidates
Small (<128)	Exhaustive	All 48 combos
Medium (128-512)	Greedy	12 most promising
Large (>512)	Heuristic	6 best-known configs

Performance Budget

Operation	Budget	Notes
Tuning overhead	<500ms	For small matrices
Cache lookup	<1μs	O(log N) map lookup
Cache hit rate	>80%	After initial warmup

Logging and Diagnostics

// AutoTuner logs tuning decisions
struct TuneLog {
    int M, N, K;
    int candidates_evaluated;
    TuneResult best;
    float time_spent_ms;
};

// Query tuning history
std::vector<TuneLog> get_tuning_history() const;

Testing Strategy

Correctness: All candidates produce mathematically correct results
Cache behavior: Hit/miss behavior under various access patterns
Performance: Selected config within 5% of true optimum
Edge cases: M=1, N=1, K=1; non-power-of-2; very large dimensions
Stability: Repeated tuning of same config produces consistent results

Implementation Files

include/autotuner.h - AutoTuner class
src/autotuner.cu - Implementation
tests/test_autotuner.cpp - Unit tests (if added later)

Future Work

Persistent cache (save/load tuning database)
Online tuning (adjust parameters during execution)
Machine learning-based configuration selection
Transfer learning between GPUs