Design an auto-tuning system that automatically selects optimal GEMM kernel configurations for given matrix dimensions and GPU architectures, eliminating manual parameter tuning.
Motivation
Architecture diversity: Different GPUs have different optimal block sizes
Dimension sensitivity: Best config varies with M, N, K dimensions
structTuneResult{intblock_m;intblock_n;intblock_k;intthreads_per_block;intkernel_variant;floatexecution_time_ms;floatgflops;};classAutoTuner{public:explicitAutoTuner(StreamManager*streams=nullptr);// Tune for specific dimensionsTuneResulttune(intM,intN,intK,conststd::vector<int>&candidates={});// Get cached result (if exists)boolget_cached(intM,intN,intK,TuneResult&result);// Cache a result manuallyvoidcache(intM,intN,intK,constTuneResult&result);// Statisticssize_tcache_size()const;floatcache_hit_rate()const;size_ttotal_tunes()const;// Clear cachevoidclear_cache();private:structCacheKey{intM,N,K;booloperator<(constCacheKey&other)const;booloperator==(constCacheKey&other)const;};std::map<CacheKey,TuneResult>cache_;StreamManager*streams_;size_ttotal_tunes_;size_tcache_hits_;};
Tuning Algorithm
1
2
3
4
5
6
7
8
1. Check cache for (M, N, K) → HIT: return immediately
2. Generate candidate configurations
3. For each candidate:
a. Allocate test buffers
b. Run kernel 10 times (warmup + measurement)
c. Record median execution time
4. Select configuration with minimum time
5. Store in cache and return
Cache Key Normalization
To maximize cache hits, dimensions are rounded to nearest power of 2: