Table of Contents
Metric
Definition
Target
GFLOPS
Giga Floating Point Operations Per Second
Near theoretical peak
Memory Bandwidth
Actual throughput / Peak throughput
> 80%
Arithmetic Intensity
FLOPs / Bytes
Varies by matrix size
Occupancy
Active warps / Maximum warps
> 50%
1
2
3
4
5
6
7
Performance (GFLOPS)
^
| ╭────────── Peak Compute
| /
| / Memory Bound | Compute Bound
| /
|╱________________> Arithmetic Intensity (FLOPs/Byte)
1
2
3
4
5
6
GEMM Arithmetic Intensity:
AI = (2 × M × N × K) / ((M×K + K×N + M×N) × 4 bytes)
Example (1024×1024×1024):
AI = 2 × 1024³ / (3 × 1024² × 4) ≈ 170.7 FLOPs/Byte
Block Size Selection
Basic Parameters
Parameter
Description
Typical Range
BM
Block M dimension
64 - 256
BN
Block N dimension
64 - 256
BK
K dimension per iteration
8 - 32
TM
Thread M dimension
4 - 8
TN
Thread N dimension
4 - 8
Configurations by Matrix Size
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// Small matrices (< 512)
GemmConfig small = {
. BLOCK_M = 64 , . BLOCK_N = 64 , . BLOCK_K = 8 ,
. TM = 4 , . TN = 4 ,
. use_double_buffer = false
};
// Medium matrices (512 - 2048)
GemmConfig medium = {
. BLOCK_M = 128 , . BLOCK_N = 128 , . BLOCK_K = 8 ,
. TM = 8 , . TN = 8 ,
. use_double_buffer = true
};
// Large matrices (> 2048)
GemmConfig large = {
. BLOCK_M = 128 , . BLOCK_N = 256 , . BLOCK_K = 16 ,
. TM = 8 , . TN = 8 ,
. use_double_buffer = true
};
Constraint Validation
1
2
3
4
5
6
Constraints:
═════════════════════════════════════════════════════════════
1. Threads per block ≤ 1024
2. Shared memory ≤ 48KB (or 96KB on A100)
3. Registers per thread ≤ 255
═════════════════════════════════════════════════════════════
Memory Optimization
Coalesced Memory Access
1
2
3
4
5
6
7
8
9
// Good: Sequential access, coalesced
for ( int i = 0 ; i < N ; i += 4 ) {
float4 val = load_float4 ( & data [ threadIdx . x * 4 + i * blockDim . x * 4 ]);
}
// Bad: Strided access, non-coalesced
for ( int i = 0 ; i < N ; i ++ ) {
float val = data [ threadIdx . x * N + i ];
}
Bank Conflict Avoidance
1
2
3
4
5
6
7
// Bank conflict
__shared__ float smem [ 32 ][ 32 ];
float val = smem [ threadIdx . x ][ 0 ]; // All in bank 0
// No bank conflict with padding
__shared__ float smem [ 32 ][ 33 ];
float val = smem [ threadIdx . x ][ 0 ]; // Distributed across banks
Vectorized Loads
1
2
3
__device__ __forceinline__ float4 load_float4 ( const float * ptr ) {
return * reinterpret_cast < const float4 *> ( ptr );
}
GPU Architecture Specific
Architecture
Compute Capability
Key Features
Volta
7.0
Independent thread scheduling
Turing
7.5
Tensor Core
Ampere
8.0
Async copy (cp.async)
Ada
8.9
Improved Tensor Core
Hopper
9.0
Transformer Engine
Ampere Async Copy (SM 8.0+)
1
2
3
4
5
6
7
#if __CUDA_ARCH__ >= 800
__device__ void async_copy ( float * smem , const float * gmem , int bytes ) {
__pipeline_memcpy_async ( smem , gmem , bytes );
__pipeline_commit ();
__pipeline_wait_prior ( 0 );
}
#endif
NVIDIA Nsight Compute
1
2
3
4
5
6
7
8
9
# Detailed kernel analysis
ncu --set full -o report ./benchmark
# Key metrics
ncu --metrics \
sm__throughput.avg.pct_of_peak_sustained_elapsed," \
memory__throughput.avg.pct_of_peak_sustained_elapsed," \
launch__occupancy \
./benchmark
Key Metrics
Metric
Description
Healthy Value
Occupancy
SM utilization
> 50%
Memory Throughput
Bandwidth utilization
> 80%
Compute Throughput
Compute utilization
> 70%
L1/TEX Hit Rate
L1 cache hit
> 90%
Best Practices
FAQ
Reason : GPU parallelism not fully utilized, kernel launch overhead significant.
Solution : Use batched GEMM, reduce block size, or use CPU for matrices < 128.
Q: Why is double buffer sometimes slower?
Reason : Increases shared memory usage, may reduce occupancy.
Solution : Check shared memory usage, don’t use for small matrices.
Q: FP16 vs FP32?
Rule of thumb :
Inference: Use FP16 (weights) + FP32 (accumulation)
Training: Use FP32
Precision-critical: Use FP32
*Last Updated: 2025-04-16
Document Version: v1.1.0*