Kernel 1: Naïve Implementation

The simplest approach — each thread computes one output element

Algorithm

For matrix multiplication C = A × B where:

A is M × K
B is K × N
C is M × N

Each thread computes:

C[row, col] = Σ A[row, k] × B[k, col]  for k = 0 to K-1

Thread Mapping

Grid:  (N + 15) / 16 × (M + 15) / 16 blocks
Block: 16 × 16 threads

Thread (tx, ty) in block (bx, by) computes:
  row = by × 16 + ty
  col = bx × 16 + tx
  C[row, col] if row < M and col < N

Memory Access Pattern

The Problem: Uncoalesced Access

Thread (0,0) reads: A[0,0], A[0,1], A[0,2] ...  →  CONSECUTIVE ✓
                   B[0,0], B[N,0], B[2N,0] ...   →  STRIDE-N ✗

Thread (0,1) reads: A[0,0], A[0,1], A[0,2] ...  →  Same as thread 0!
                   B[0,1], B[N,1], B[2N,1] ...   →  STRIDE-N ✗

When reading matrix B, consecutive threads access elements separated by N floats (stride-N access). This causes:

Memory request serialization — GPU must issue separate loads
Cache inefficiency — loaded data isn't shared between threads
~12.5% bandwidth utilization — vs 100% with coalesced access

Visual Representation

Memory Layout:
A (row-major):    [0,0] [0,1] [0,2] ... [1,0] [1,1] ...
B (row-major):    [0,0] [0,1] [0,2] ... [1,0] [1,1] ...
                          ↑
                    Threads need [k, col]
                    With col = 0,1,2,3...
                    This is NOT consecutive in memory!

Code Walkthrough

1. Kernel Signature

cpp

__global__ void sgemm_naive_kernel(
    const float* A, const float* B, float* C,
    int M, int N, int K)

__global__ — CUDA kernel launchable from CPU
Pointers marked const for read-only matrices A and B
Dimensions passed as int (matrices assumed row-major)

2. Thread Indexing

cpp

int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;

Uses 2D grid/block structure for natural matrix mapping
blockIdx identifies which tile of output
threadIdx identifies position within tile

3. Bounds Checking

cpp

if (row < M && col < N) {

Essential for matrices not perfectly divisible by block size. Prevents out-of-bounds writes.

4. Inner Product Computation

cpp

float sum = 0.0f;
for (int k = 0; k < K; ++k) {
    sum += A[row * K + k] * B[k * N + col];
}

Accumulator in register (float sum)
Single loop over K dimension
Row of A × Column of B

Learning Checkpoints

After understanding this kernel, you should be able to:

[x] Explain CUDA's thread hierarchy (grid/block/thread)
[x] Identify uncoalesced memory access patterns
[x] Calculate arithmetic intensity for matrix multiply
[x] Understand why this kernel is memory-bound

Kernel 1: Naïve Implementation ​

Algorithm ​

Thread Mapping ​

Memory Access Pattern ​

The Problem: Uncoalesced Access ​

Visual Representation ​

Code Walkthrough ​

1. Kernel Signature ​

2. Thread Indexing ​

3. Bounds Checking ​

4. Inner Product Computation ​

Learning Checkpoints ​

Further Reading ​