Performance Tuning

Language: English 简体中文

Understanding GPU Performance
Block Size Selection
Memory Optimization
GPU Architecture Specific
Performance Analysis Tools
Best Practices
FAQ

Understanding GPU Performance

Key Performance Metrics

Metric	Definition	Target
GFLOPS	Giga Floating Point Operations Per Second	Near theoretical peak
Memory Bandwidth	Actual throughput / Peak throughput	> 80%
Arithmetic Intensity	FLOPs / Bytes	Varies by matrix size
Occupancy	Active warps / Maximum warps	> 50%

Roofline Performance Model

Performance (GFLOPS)
    ^
    |    ╭────────── Peak Compute
    |   /
    |  /  Memory Bound | Compute Bound
    | /
    |╱________________> Arithmetic Intensity (FLOPs/Byte)

Arithmetic Intensity Formula

GEMM Arithmetic Intensity:

AI = (2 × M × N × K) / ((M×K + K×N + M×N) × 4 bytes)

Example (1024×1024×1024):
AI = 2 × 1024³ / (3 × 1024² × 4) ≈ 170.7 FLOPs/Byte

Block Size Selection

Basic Parameters

Parameter	Description	Typical Range
BM	Block M dimension	64 - 256
BN	Block N dimension	64 - 256
BK	K dimension per iteration	8 - 32
TM	Thread M dimension	4 - 8
TN	Thread N dimension	4 - 8

Configurations by Matrix Size

// Small matrices (< 512)
GemmConfig small = {
    .BLOCK_M = 64, .BLOCK_N = 64, .BLOCK_K = 8,
    .TM = 4, .TN = 4,
    .use_double_buffer = false
};

// Medium matrices (512 - 2048)
GemmConfig medium = {
    .BLOCK_M = 128, .BLOCK_N = 128, .BLOCK_K = 8,
    .TM = 8, .TN = 8,
    .use_double_buffer = true
};

// Large matrices (> 2048)
GemmConfig large = {
    .BLOCK_M = 128, .BLOCK_N = 256, .BLOCK_K = 16,
    .TM = 8, .TN = 8,
    .use_double_buffer = true
};

Constraint Validation

Constraints:
═════════════════════════════════════════════════════════════
1. Threads per block ≤ 1024
2. Shared memory ≤ 48KB (or 96KB on A100)
3. Registers per thread ≤ 255
═════════════════════════════════════════════════════════════

Memory Optimization

Coalesced Memory Access

// Good: Sequential access, coalesced
for (int i = 0; i < N; i += 4) {
    float4 val = load_float4(&data[threadIdx.x * 4 + i * blockDim.x * 4]);
}

// Bad: Strided access, non-coalesced
for (int i = 0; i < N; i++) {
    float val = data[threadIdx.x * N + i];
}

Bank Conflict Avoidance

// Bank conflict
__shared__ float smem[32][32];
float val = smem[threadIdx.x][0];  // All in bank 0

// No bank conflict with padding
__shared__ float smem[32][33];
float val = smem[threadIdx.x][0];  // Distributed across banks

Vectorized Loads

__device__ __forceinline__ float4 load_float4(const float* ptr) {
    return *reinterpret_cast<const float4*>(ptr);
}

GPU Architecture Specific

Architecture	Compute Capability	Key Features
Volta	7.0	Independent thread scheduling
Turing	7.5	Tensor Core
Ampere	8.0	Async copy (cp.async)
Ada	8.9	Improved Tensor Core
Hopper	9.0	Transformer Engine

Ampere Async Copy (SM 8.0+)

#if __CUDA_ARCH__ >= 800
__device__ void async_copy(float* smem, const float* gmem, int bytes) {
    __pipeline_memcpy_async(smem, gmem, bytes);
    __pipeline_commit();
    __pipeline_wait_prior(0);
}
#endif

Performance Analysis Tools

NVIDIA Nsight Compute

# Detailed kernel analysis
ncu --set full -o report ./benchmark

# Key metrics
ncu --metrics \
    sm__throughput.avg.pct_of_peak_sustained_elapsed," \
    memory__throughput.avg.pct_of_peak_sustained_elapsed," \
    launch__occupancy \
    ./benchmark

Key Metrics

Metric	Description	Healthy Value
Occupancy	SM utilization	> 50%
Memory Throughput	Bandwidth utilization	> 80%
Compute Throughput	Compute utilization	> 70%
L1/TEX Hit Rate	L1 cache hit	> 90%

Best Practices

Performance Checklist

FAQ

Q: Why is small matrix performance poor?

Reason: GPU parallelism not fully utilized, kernel launch overhead significant.

Solution: Use batched GEMM, reduce block size, or use CPU for matrices < 128.

Q: Why is double buffer sometimes slower?

Reason: Increases shared memory usage, may reduce occupancy.

Solution: Check shared memory usage, don’t use for small matrices.

Q: FP16 vs FP32?

Rule of thumb:

Inference: Use FP16 (weights) + FP32 (accumulation)
Training: Use FP32
Precision-critical: Use FP32

*Last Updated: 2025-04-16

Document Version: v1.1.0*

Table of Contents