Kernel 2: Tiled Implementation

Shared memory blocking for better data reuse

The Problem with Naïve

In the naïve kernel:

To compute one row of C, we read that row of A N times
To compute one column of C, we read that column of B M times

┌─────────────────────────────────────────┐
│           C[row, :] (1 row)             │
├─────────────────────────────────────────┤
│  = A[row, :] × B[0, :]  ← read A row    │
│  = A[row, :] × B[1, :]  ← read SAME row!│
│  = A[row, :] × B[2, :]  ← and again!    │
└─────────────────────────────────────────┘

Implementation

cpp

// File: src/kernels/tiled_sgemm.cuh

template<int TILE_SIZE = 32>
__global__ void sgemm_tiled_kernel(
    const float* A, const float* B, float* C,
    int M, int N, int K)
{
    // Shared memory tiles
    __shared__ float As[TILE_SIZE][TILE_SIZE];
    __shared__ float Bs[TILE_SIZE][TILE_SIZE];

    // Thread indices
    int tx = threadIdx.x;
    int ty = threadIdx.y;
    
    // Global position
    int row = blockIdx.y * TILE_SIZE + ty;
    int col = blockIdx.x * TILE_SIZE + tx;

    float sum = 0.0f;
    int num_tiles = (K + TILE_SIZE - 1) / TILE_SIZE;

    // Loop over tiles
    for (int t = 0; t < num_tiles; ++t) {
        // Calculate tile positions
        int a_col = t * TILE_SIZE + tx;
        int b_row = t * TILE_SIZE + ty;

        // Load tile from A (coalesced)
        if (row < M && a_col < K)
            As[ty][tx] = A[row * K + a_col];
        else
            As[ty][tx] = 0.0f;

        // Load tile from B (coalesced)
        if (b_row < K && col < N)
            Bs[ty][tx] = B[b_row * N + col];
        else
            Bs[ty][tx] = 0.0f;

        __syncthreads();  // Wait for all loads to complete

        // Compute tile multiplication
        for (int k = 0; k < TILE_SIZE; ++k) {
            sum += As[ty][k] * Bs[k][tx];
        }

        __syncthreads();  // Wait for all threads to finish computing
    }

    // Write result
    if (row < M && col < N) {
        C[row * N + col] = sum;
    }
}

Memory Architecture

┌─────────────────────────────────────────────────────────┐
│                     GPU Architecture                    │
├─────────────────────────────────────────────────────────┤
│  ┌─────────────┐    ┌───────────────────────────────┐  │
│  │   Global    │    │         Shared Memory         │  │
│  │   Memory    │───▶│      (per thread block)       │  │
│  │   (slow)    │    │  ┌─────────┐    ┌─────────┐   │  │
│  └─────────────┘    │  │  As[][] │    │  Bs[][] │   │  │
│                     │  │ TILE×TIL│    │ TILE×TIL│   │  │
│                     │  └────┬────┘    └────┬────┘   │  │
│                     │       │              │        │  │
│                     └───────┼──────────────┼────────┘  │
│                             ▼              ▼           │
│                         └──────────────────┐           │
│                           Compute (registers)           │
│                                         │              │
│                                         ▼              │
│                                     Write to Global    │
└─────────────────────────────────────────────────────────┘

Synchronization Points

Two __syncthreads() barriers are critical:

After loading tiles: Ensures all data is in shared memory before any thread starts computing
After computing: Prevents threads from overwriting shared memory while others are still reading

cpp

__syncthreads();  // Load complete

// Compute phase...

__syncthreads();  // Compute complete, safe to load next tile

Common Bug
 Missing either __syncthreads() causes race conditions — some threads read garbage data or write before others finish. 

Tile Size Selection

TILE_SIZE	Shared Memory	Occupancy	Performance
16	2 KB	High	Lower (less reuse)
32	8 KB	Medium	Good balance
64	32 KB	Low	Limited by SM capacity

Default TILE_SIZE = 32 fits well in typical 48-64 KB shared memory per SM.

Key Takeaways

Shared Memory is ~100× faster than global memory
Tiling reduces global memory bandwidth by reusing data
Coalesced Access is achieved when consecutive threads read consecutive addresses
Synchronization is required when threads share data
Template Parameters allow compile-time tile size selection

Kernel 2: Tiled Implementation ​

The Problem with Naïve ​

Implementation ​

Memory Architecture ​

Synchronization Points ​

Tile Size Selection ​

Key Takeaways ​

Kernel 2: Tiled Implementation

The Problem with Naïve

Implementation

Memory Architecture

Synchronization Points

Tile Size Selection

Key Takeaways