Kernel 4: Double Buffer

Overlapping memory loads with computation

The Problem: Sequential Execution

In the tiled kernel:

Timeline:
  Load Tile 0 ──────────────────▶
                                Compute Tile 0 ───────────────▶
                                                              Load Tile 1 ───▶
                                                                              Compute Tile 1 ──▶

Problem: GPU is IDLE during loads, MEMORY is IDLE during compute

Execution Timeline

Without Double Buffering:
┌────────────────────────────────────────────────────────────┐
│ Cycle:  0    100   200   300   400   500   600   700       │
│                                                             │
│ Load0  [████████████]                                      │
│        ↓ Idle                                            │
│ Comp0         [████████████]                               │
│                      ↓                                     │
│ Load1                 [████████████]                       │
│                           ↓                               │
│ Comp1                          [████████████]              │
│                                                             │
│ ALU Utilization:  ████░░░░████░░░░████░░░░  (~50%)        │
└────────────────────────────────────────────────────────────┘

Shared Memory Layout

cpp

// Single buffer (before)
__shared__ float As[TILE_SIZE][TILE_SIZE + 1];
__shared__ float Bs[TILE_SIZE][TILE_SIZE + 1];

// Double buffer (after)
__shared__ float As[2][TILE_SIZE][TILE_SIZE + 1];  // [2] for ping-pong
__shared__ float Bs[2][TILE_SIZE][TILE_SIZE + 1];

Trade-off: 2× shared memory usage for latency hiding.

Key Concepts

1. Buffer Indexing

cpp

int curr = t % 2;        // 0, 1, 0, 1, 0, 1, ...
int next = (t + 1) % 2;  // 1, 0, 1, 0, 1, 0, ...

The modulo operator creates the ping-pong pattern.

2. Overlap Requirements

For effective overlap, the compute time should be ≈ load time:

Compute Time = TILE_SIZE³ / (threads × FLOPs/cycle)
Load Time = TILE_SIZE² × 2 × sizeof(float) / bandwidth

Ideal: TILE_SIZE chosen so these are balanced

3. Synchronization

Only one __syncthreads() is needed per iteration:

Ensures all threads finish computing current tile
Ensures all threads finish loading next tile

Performance Characteristics

Metric	Bank-Free	Double Buffer	Improvement
GFLOPS (1024³)	673	701	+4%
Shared Memory	8.4 KB	16.8 KB	2×
Register Pressure	Low	Moderate	—
Occupancy	Higher	Lower	Trade-off

Note
 The performance improvement (~4%) is modest because modern GPUs have effective memory latency hiding through warp scheduling. Double buffering becomes more impactful on memory-bound kernels with minimal compute. 

When Double Buffering Helps Most

Scenario	Benefit
Very memory-bound kernels	High
Large TILE_SIZE	High (more compute to overlap)
Smaller GPUs (fewer warps)	Higher (less natural latency hiding)
Compute-heavy kernels	Low (already compute-bound)

Key Takeaways

Double Buffer: Two buffers alternate between load and compute roles
Overlap: Hide memory latency by computing while loading
Ping-Pong: Use t % 2 to alternate buffer indices
Trade-off: 2× shared memory for better latency hiding
Synchronization: Single barrier handles both operations

Kernel 4: Double Buffer ​

The Problem: Sequential Execution ​

Execution Timeline ​

Shared Memory Layout ​

Key Concepts ​

1. Buffer Indexing ​

2. Overlap Requirements ​

3. Synchronization ​

Performance Characteristics ​

When Double Buffering Helps Most ​

Key Takeaways ​