Divide matrices into small tiles, load into shared memory for reuse.
1
2
3
4
5
6
Global memory access reduction:
═════════════════════════════════════════════════════════════
Original: Each element read K times from global memory
Tiled: Each element read K/TILE_SIZE times from global memory
═════════════════════════════════════════════════════════════
Reduction factor: TILE_SIZE times
Use padding to avoid shared memory bank conflicts:
1
2
3
4
5
6
7
// With bank conflict__shared__floatsmem[32][32];floatval=smem[threadIdx.x][0];// All in bank 0// No bank conflict: Padding distributes access__shared__floatsmem[32][33];// +1 paddingfloatval=smem[threadIdx.x][0];// Distributed across banks 0-31
Performance: ~30-40% of cuBLAS
Level 4: Double Buffering
Core Idea
Use two sets of shared memory buffers, prefetch next tile while computing current tile.
Timeline Comparison
1
2
3
4
5
6
7
8
9
10
11
12
Without Double Buffering:
═════════════════════════════════════════════════════════════
├─Load 0─┼─Comp 0─┼─Load 1─┼─Comp 1─┼─Load 2─┼─Comp 2─┤
Compute waits for load
With Double Buffering:
═════════════════════════════════════════════════════════════
├─Load 0─┼─────────────────────────────────────────────┤
├─Comp 0─┼─Comp 1─┼─Comp 2─┤
├─Load 1─┤
├─Load 2─┤
Load and compute overlap
Performance: ~40-50% of cuBLAS
Level 5: Register Blocking
Core Idea
Each thread computes multiple output elements, increasing compute density and keeping data in registers.
Parameter Configuration
1
2
3
4
5
6
7
8
template<intBM,// Block M dimension (128)intBN,// Block N dimension (128)intBK,// K dimension per iteration (8)intTM,// Thread M dimension (8)intTN// Thread N dimension (8)>__global__voidoptimized_gemm(...);
Performance: ~70-80% of cuBLAS
Level 6: Kernel Fusion
Core Idea
Merge multiple operations into one kernel, eliminating intermediate result memory reads/writes.
Fusion Effect Comparison
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Separate Execution:
═════════════════════════════════════════════════════════════
GEMM: C = A × B → Read A, B; Write C
Bias: C = C + bias → Read C, bias; Write C
ReLU: C = max(0, C) → Read C; Write C
═════════════════════════════════════════════════════════════
Total: 3 reads C + 3 writes C = 6 C memory accesses
Fused Execution:
═════════════════════════════════════════════════════════════
Fused: C = ReLU(A × B + bias)
═════════════════════════════════════════════════════════════
Total: 0 intermediate memory accesses
Saved: 2 × M × N × sizeof(float) bytes
Performance: ~80-85% of cuBLAS
Level 7: Vectorized Loads
Core Idea
Use 128-bit vector loads (float4) to reduce memory transaction count.
Vector Load Principle
1
2
3
4
5
6
7
8
9
// Scalar load: 4 transactions of 32-bitfloata=A[idx];floatb=A[idx+1];floatc=A[idx+2];floatd=A[idx+3];// Vector load: 1 transaction of 128-bitfloat4vec=*reinterpret_cast<constfloat4*>(&A[idx]);// vec.x, vec.y, vec.z, vec.w
Performance: ~85-90% of cuBLAS
Performance Summary
Measured Performance (RTX 3080, 1024×1024×1024)
Kernel
Time (ms)
GFLOPS
vs cuBLAS
cuBLAS
0.31
6920
100%
Naive
3.10
694
10%
Tiled
1.55
1388
20%
Coalesced
1.03
2088
30%
Double Buffer
0.78
2768
40%
Optimized
0.44
4870
70%
Fused
0.38
5630
81%
Vectorized
0.35
6130
89%
Matrix Size Strategy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
═════════════════════════════════════════════════════════════
Small matrices (< 512):
- Use small block size (64 × 64)
- Avoid double buffer
- Consider batched GEMM
Medium matrices (512 - 2048):
- Standard config (128 × 128)
- Enable all optimizations
- Use AutoTuner for best config
Large matrices (> 2048):
- Large block size (128 × 256)
- Vectorized loads
- Use async copy (Ampere+)
═════════════════════════════════════════════════════════════