Language: English 简体中文

Table of Contents


Understanding GPU Performance

Key Performance Metrics

Metric Definition Target
GFLOPS Giga Floating Point Operations Per Second Near theoretical peak
Memory Bandwidth Actual throughput / Peak throughput > 80%
Arithmetic Intensity FLOPs / Bytes Varies by matrix size
Occupancy Active warps / Maximum warps > 50%

Roofline Performance Model

1
2
3
4
5
6
7
Performance (GFLOPS)
    ^
    |    ╭────────── Peak Compute
    |   /
    |  /  Memory Bound | Compute Bound
    | /
    |╱________________> Arithmetic Intensity (FLOPs/Byte)

Arithmetic Intensity Formula

1
2
3
4
5
6
GEMM Arithmetic Intensity:

AI = (2 × M × N × K) / ((M×K + K×N + M×N) × 4 bytes)

Example (1024×1024×1024):
AI = 2 × 1024³ / (3 × 1024² × 4) ≈ 170.7 FLOPs/Byte

Block Size Selection

Basic Parameters

Parameter Description Typical Range
BM Block M dimension 64 - 256
BN Block N dimension 64 - 256
BK K dimension per iteration 8 - 32
TM Thread M dimension 4 - 8
TN Thread N dimension 4 - 8

Configurations by Matrix Size

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// Small matrices (< 512)
GemmConfig small = {
    .BLOCK_M = 64, .BLOCK_N = 64, .BLOCK_K = 8,
    .TM = 4, .TN = 4,
    .use_double_buffer = false
};

// Medium matrices (512 - 2048)
GemmConfig medium = {
    .BLOCK_M = 128, .BLOCK_N = 128, .BLOCK_K = 8,
    .TM = 8, .TN = 8,
    .use_double_buffer = true
};

// Large matrices (> 2048)
GemmConfig large = {
    .BLOCK_M = 128, .BLOCK_N = 256, .BLOCK_K = 16,
    .TM = 8, .TN = 8,
    .use_double_buffer = true
};

Constraint Validation

1
2
3
4
5
6
Constraints:
═════════════════════════════════════════════════════════════
1. Threads per block ≤ 1024
2. Shared memory ≤ 48KB (or 96KB on A100)
3. Registers per thread ≤ 255
═════════════════════════════════════════════════════════════

Memory Optimization

Coalesced Memory Access

1
2
3
4
5
6
7
8
9
// Good: Sequential access, coalesced
for (int i = 0; i < N; i += 4) {
    float4 val = load_float4(&data[threadIdx.x * 4 + i * blockDim.x * 4]);
}

// Bad: Strided access, non-coalesced
for (int i = 0; i < N; i++) {
    float val = data[threadIdx.x * N + i];
}

Bank Conflict Avoidance

1
2
3
4
5
6
7
// Bank conflict
__shared__ float smem[32][32];
float val = smem[threadIdx.x][0];  // All in bank 0

// No bank conflict with padding
__shared__ float smem[32][33];
float val = smem[threadIdx.x][0];  // Distributed across banks

Vectorized Loads

1
2
3
__device__ __forceinline__ float4 load_float4(const float* ptr) {
    return *reinterpret_cast<const float4*>(ptr);
}

GPU Architecture Specific

Architecture Compute Capability Key Features
Volta 7.0 Independent thread scheduling
Turing 7.5 Tensor Core
Ampere 8.0 Async copy (cp.async)
Ada 8.9 Improved Tensor Core
Hopper 9.0 Transformer Engine

Ampere Async Copy (SM 8.0+)

1
2
3
4
5
6
7
#if __CUDA_ARCH__ >= 800
__device__ void async_copy(float* smem, const float* gmem, int bytes) {
    __pipeline_memcpy_async(smem, gmem, bytes);
    __pipeline_commit();
    __pipeline_wait_prior(0);
}
#endif

Performance Analysis Tools

NVIDIA Nsight Compute

1
2
3
4
5
6
7
8
9
# Detailed kernel analysis
ncu --set full -o report ./benchmark

# Key metrics
ncu --metrics \
    sm__throughput.avg.pct_of_peak_sustained_elapsed," \
    memory__throughput.avg.pct_of_peak_sustained_elapsed," \
    launch__occupancy \
    ./benchmark

Key Metrics

Metric Description Healthy Value
Occupancy SM utilization > 50%
Memory Throughput Bandwidth utilization > 80%
Compute Throughput Compute utilization > 70%
L1/TEX Hit Rate L1 cache hit > 90%

Best Practices

Performance Checklist

  • Choose appropriate block size
  • Ensure coalesced memory access
  • Avoid bank conflicts
  • Use vectorized loads
  • Consider double buffering
  • Check occupancy (> 50%)
  • Analyze stall reasons
  • Consider kernel fusion
  • Use AutoTuner

FAQ

Q: Why is small matrix performance poor?

Reason: GPU parallelism not fully utilized, kernel launch overhead significant.

Solution: Use batched GEMM, reduce block size, or use CPU for matrices < 128.

Q: Why is double buffer sometimes slower?

Reason: Increases shared memory usage, may reduce occupancy.

Solution: Check shared memory usage, don’t use for small matrices.

Q: FP16 vs FP32?

Rule of thumb:

  • Inference: Use FP16 (weights) + FP32 (accumulation)
  • Training: Use FP32
  • Precision-critical: Use FP32


*Last Updated: 2025-04-16 Document Version: v1.1.0*

Back to top

MIT License | A learning project for the CUDA community