Language: English 简体中文

Table of Contents


GEMM Fundamentals

Mathematical Definition

GEMM (General Matrix Multiply) formula:

1
C = α × A × B + β × C

Simplified version (α=1, β=0):

1
2
3
        K
C(m,n) = Σ A(m,k) × B(k,n)
       k=1

Where:

  • A: M × K matrix
  • B: K × N matrix
  • C: M × N matrix

Computational Complexity Analysis

Metric Formula Description
FLOPs 2 × M × N × K Multiply + Add
Memory reads M×K + K×N + M×N floats Read A, B; Write C
Arithmetic intensity 2×M×N×K / (M×K + K×N + M×N) / 4 FLOPs/byte

Why GEMM Matters?

Core computation in neural networks:

Neural Network Layer GEMM Form
Fully Connected Y = X × W + b
Convolution (im2col) Converted to GEMM
Attention Q × K^T, Score × V
LSTM/GRU Multiple GEMM operations

Level 1: Naive Implementation

Algorithm

Each thread computes one element of the output matrix.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
__global__ void naive_matmul_kernel(const float* A, const float* B, float* C,
                                     int M, int N, int K) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    
    if (row < M && col < N) {
        float sum = 0.0f;
        for (int k = 0; k < K; k++) {
            sum += A[row * K + k] * B[k * N + col];
        }
        C[row * N + col] = sum;
    }
}

void launch_naive_matmul(const float* A, const float* B, float* C,
                         int M, int N, int K, cudaStream_t stream) {
    dim3 block(16, 16);
    dim3 grid((N + 15) / 16, (M + 15) / 16);
    naive_matmul_kernel<<<grid, block, 0, stream>>>(A, B, C, M, N, K);
}

Memory Access Pattern

1
2
3
4
5
6
7
Thread (row, col) access pattern:
═════════════════════════════════════════════════════════════
A: A[row, 0], A[row, 1], ..., A[row, K-1]   ← K reads (sequential)
B: B[0, col], B[1, col], ..., B[K-1, col]   ← K reads (strided)
C: C[row, col]                               ← 1 write
═════════════════════════════════════════════════════════════
Total: 2K reads + 1 write per output element

Performance Bottlenecks

Issue Impact Solution
Excessive global memory access Bandwidth bottleneck Use shared memory
B matrix column-major access Non-coalesced, low efficiency Transpose read or adjust layout
No data reuse Each element read K times from global memory Block loading to shared memory

Performance: ~5-10% of cuBLAS


Level 2: Tiled GEMM

Core Idea

Divide matrices into small tiles, load into shared memory for reuse.

1
2
3
4
5
6
Global memory access reduction:
═════════════════════════════════════════════════════════════
Original: Each element read K times from global memory
Tiled:    Each element read K/TILE_SIZE times from global memory
═════════════════════════════════════════════════════════════
Reduction factor: TILE_SIZE times

Algorithm

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
constexpr int TILE_SIZE = 32;

__global__ void tiled_gemm_kernel(const float* A, const float* B, float* C,
                                   int M, int N, int K) {
    __shared__ float As[TILE_SIZE][TILE_SIZE];
    __shared__ float Bs[TILE_SIZE][TILE_SIZE];
    
    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;
    
    float sum = 0.0f;
    int num_tiles = (K + TILE_SIZE - 1) / TILE_SIZE;
    
    for (int t = 0; t < num_tiles; t++) {
        // Cooperative load A tile
        int a_col = t * TILE_SIZE + threadIdx.x;
        if (row < M && a_col < K) {
            As[threadIdx.y][threadIdx.x] = A[row * K + a_col];
        } else {
            As[threadIdx.y][threadIdx.x] = 0.0f;
        }
        
        // Cooperative load B tile
        int b_row = t * TILE_SIZE + threadIdx.y;
        if (b_row < K && col < N) {
            Bs[threadIdx.y][threadIdx.x] = B[b_row * N + col];
        } else {
            Bs[threadIdx.y][threadIdx.x] = 0.0f;
        }
        
        __syncthreads();
        
        // Compute partial sum
        for (int k = 0; k < TILE_SIZE; k++) {
            sum += As[threadIdx.y][k] * Bs[k][threadIdx.x];
        }
        
        __syncthreads();
    }
    
    if (row < M && col < N) {
        C[row * N + col] = sum;
    }
}

Performance: ~20-30% of cuBLAS


Level 3: Memory Coalescing

Core Idea

Ensure threads in the same warp access consecutive memory addresses to maximize memory bandwidth.

Coalesced vs Non-coalesced Access

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Coalesced Access - Efficient:
═════════════════════════════════════════════════════════════
Thread  0  1  2  3  ...  31
        │  │  │  │       │
        ▼  ▼  ▼  ▼       ▼
Addr  0x1000 0x1004 0x1008 0x100C ... 0x107C
═════════════════════════════════════════════════════════════
→ 1 memory transaction of 128 bytes

Non-coalesced Access - Inefficient:
═════════════════════════════════════════════════════════════
Thread  0     1     2     3     ...  31
        │     │     │     │          │
        ▼     ▼     ▼     ▼          ▼
Addr  0x1000 0x2000 0x3000 0x4000 ... 0x20000
═════════════════════════════════════════════════════════════
→ 32 separate memory transactions

Bank Conflict Avoidance

Use padding to avoid shared memory bank conflicts:

1
2
3
4
5
6
7
// With bank conflict
__shared__ float smem[32][32];
float val = smem[threadIdx.x][0];  // All in bank 0

// No bank conflict: Padding distributes access
__shared__ float smem[32][33];  // +1 padding
float val = smem[threadIdx.x][0];  // Distributed across banks 0-31

Performance: ~30-40% of cuBLAS


Level 4: Double Buffering

Core Idea

Use two sets of shared memory buffers, prefetch next tile while computing current tile.

Timeline Comparison

1
2
3
4
5
6
7
8
9
10
11
12
Without Double Buffering:
═════════════════════════════════════════════════════════════
├─Load 0─┼─Comp 0─┼─Load 1─┼─Comp 1─┼─Load 2─┼─Comp 2─┤
          Compute waits for load

With Double Buffering:
═════════════════════════════════════════════════════════════
├─Load 0─┼─────────────────────────────────────────────┤
          ├─Comp 0─┼─Comp 1─┼─Comp 2─┤
                    ├─Load 1─┤
                              ├─Load 2─┤
          Load and compute overlap

Performance: ~40-50% of cuBLAS


Level 5: Register Blocking

Core Idea

Each thread computes multiple output elements, increasing compute density and keeping data in registers.

Parameter Configuration

1
2
3
4
5
6
7
8
template<
    int BM,    // Block M dimension (128)
    int BN,    // Block N dimension (128)
    int BK,    // K dimension per iteration (8)
    int TM,    // Thread M dimension (8)
    int TN     // Thread N dimension (8)
>
__global__ void optimized_gemm(...);

Performance: ~70-80% of cuBLAS


Level 6: Kernel Fusion

Core Idea

Merge multiple operations into one kernel, eliminating intermediate result memory reads/writes.

Fusion Effect Comparison

1
2
3
4
5
6
7
8
9
10
11
12
13
14
Separate Execution:
═════════════════════════════════════════════════════════════
GEMM:  C = A × B           → Read A, B; Write C
Bias:  C = C + bias        → Read C, bias; Write C
ReLU:  C = max(0, C)       → Read C; Write C
═════════════════════════════════════════════════════════════
Total: 3 reads C + 3 writes C = 6 C memory accesses

Fused Execution:
═════════════════════════════════════════════════════════════
Fused: C = ReLU(A × B + bias)
═════════════════════════════════════════════════════════════
Total: 0 intermediate memory accesses
Saved: 2 × M × N × sizeof(float) bytes

Performance: ~80-85% of cuBLAS


Level 7: Vectorized Loads

Core Idea

Use 128-bit vector loads (float4) to reduce memory transaction count.

Vector Load Principle

1
2
3
4
5
6
7
8
9
// Scalar load: 4 transactions of 32-bit
float a = A[idx];
float b = A[idx + 1];
float c = A[idx + 2];
float d = A[idx + 3];

// Vector load: 1 transaction of 128-bit
float4 vec = *reinterpret_cast<const float4*>(&A[idx]);
// vec.x, vec.y, vec.z, vec.w

Performance: ~85-90% of cuBLAS


Performance Summary

Measured Performance (RTX 3080, 1024×1024×1024)

Kernel Time (ms) GFLOPS vs cuBLAS
cuBLAS 0.31 6920 100%
Naive 3.10 694 10%
Tiled 1.55 1388 20%
Coalesced 1.03 2088 30%
Double Buffer 0.78 2768 40%
Optimized 0.44 4870 70%
Fused 0.38 5630 81%
Vectorized 0.35 6130 89%

Matrix Size Strategy

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
═════════════════════════════════════════════════════════════
Small matrices (< 512):
  - Use small block size (64 × 64)
  - Avoid double buffer
  - Consider batched GEMM

Medium matrices (512 - 2048):
  - Standard config (128 × 128)
  - Enable all optimizations
  - Use AutoTuner for best config

Large matrices (> 2048):
  - Large block size (128 × 256)
  - Vectorized loads
  - Use async copy (Ampere+)
═════════════════════════════════════════════════════════════

Future Optimizations

1. Tensor Core (WMMA)

  • Use FP16/BF16 precision
  • Leverage dedicated matrix compute units
  • Can achieve 2-4x FP32 performance

2. Async Copy (Ampere+)

1
2
3
4
// Ampere async copy: hide memory latency
__pipeline_memcpy_async(&smem[idx], &gmem[idx], sizeof(float4));
__pipeline_commit();
__pipeline_wait_prior(0);

3. Warp-level Optimization

  • Use warp shuffle instructions
  • Warp-level matrix multiplication


*Last Updated: 2025-04-16 Document Version: v1.1.0*

Back to top

MIT License | A learning project for the CUDA community