RFC 0005: Auto-Tuner System

Status

Status: Accepted Created: 2024 Last Updated: 2024

Overview

Design an auto-tuning system that automatically selects optimal GEMM kernel configurations for given matrix dimensions and GPU architectures, eliminating manual parameter tuning.

Motivation

  1. Architecture diversity: Different GPUs have different optimal block sizes
  2. Dimension sensitivity: Best config varies with M, N, K dimensions
  3. Eliminate guesswork: Automatic search replaces manual tuning
  4. Adaptability: Re-tune when workload characteristics change

Design

Tuning Space

The auto-tuner searches over these parameters:

Parameter Range Description
BLOCK_M {32, 64, 128, 256} Tile row size
BLOCK_N {32, 64, 128, 256} Tile column size
BLOCK_K {8, 16, 32} K dimension tile size
THREADS_PER_BLOCK {128, 256, 512} Thread block size
KERNEL_VARIANT {naive, tiled, optimized, vectorized} Which kernel

API Design

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
struct TuneResult {
    int block_m;
    int block_n;
    int block_k;
    int threads_per_block;
    int kernel_variant;
    float execution_time_ms;
    float gflops;
};

class AutoTuner {
public:
    explicit AutoTuner(StreamManager* streams = nullptr);

    // Tune for specific dimensions
    TuneResult tune(int M, int N, int K,
                    const std::vector<int>& candidates = {});

    // Get cached result (if exists)
    bool get_cached(int M, int N, int K, TuneResult& result);

    // Cache a result manually
    void cache(int M, int N, int K, const TuneResult& result);

    // Statistics
    size_t cache_size() const;
    float cache_hit_rate() const;
    size_t total_tunes() const;

    // Clear cache
    void clear_cache();

private:
    struct CacheKey {
        int M, N, K;
        bool operator<(const CacheKey& other) const;
        bool operator==(const CacheKey& other) const;
    };

    std::map<CacheKey, TuneResult> cache_;
    StreamManager* streams_;
    size_t total_tunes_;
    size_t cache_hits_;
};

Tuning Algorithm

1
2
3
4
5
6
7
8
1. Check cache for (M, N, K) → HIT: return immediately
2. Generate candidate configurations
3. For each candidate:
   a. Allocate test buffers
   b. Run kernel 10 times (warmup + measurement)
   c. Record median execution time
4. Select configuration with minimum time
5. Store in cache and return

Cache Key Normalization

To maximize cache hits, dimensions are rounded to nearest power of 2:

1
normalize(M) = 2^ceil(log2(M))

Candidate Generation Strategy

M, N, K Range Strategy Candidates
Small (<128) Exhaustive All 48 combos
Medium (128-512) Greedy 12 most promising
Large (>512) Heuristic 6 best-known configs

Performance Budget

Operation Budget Notes
Tuning overhead <500ms For small matrices
Cache lookup <1μs O(log N) map lookup
Cache hit rate >80% After initial warmup

Logging and Diagnostics

1
2
3
4
5
6
7
8
9
10
// AutoTuner logs tuning decisions
struct TuneLog {
    int M, N, K;
    int candidates_evaluated;
    TuneResult best;
    float time_spent_ms;
};

// Query tuning history
std::vector<TuneLog> get_tuning_history() const;

Testing Strategy

  1. Correctness: All candidates produce mathematically correct results
  2. Cache behavior: Hit/miss behavior under various access patterns
  3. Performance: Selected config within 5% of true optimum
  4. Edge cases: M=1, N=1, K=1; non-power-of-2; very large dimensions
  5. Stability: Repeated tuning of same config produces consistent results

Implementation Files

  • include/autotuner.h - AutoTuner class
  • src/autotuner.cu - Implementation
  • tests/test_autotuner.cpp - Unit tests (if added later)

Future Work

  • Persistent cache (save/load tuning database)
  • Online tuning (adjust parameters during execution)
  • Machine learning-based configuration selection
  • Transfer learning between GPUs

Back to top

MIT License | A learning project for the CUDA community