Kernel 1: Naïve Implementation
The simplest approach — each thread computes one output element
Algorithm
For matrix multiplication C = A × B where:
- A is M × K
- B is K × N
- C is M × N
Each thread computes:
C[row, col] = Σ A[row, k] × B[k, col] for k = 0 to K-1Thread Mapping
Grid: (N + 15) / 16 × (M + 15) / 16 blocks
Block: 16 × 16 threads
Thread (tx, ty) in block (bx, by) computes:
row = by × 16 + ty
col = bx × 16 + tx
C[row, col] if row < M and col < NMemory Access Pattern
The Problem: Uncoalesced Access
Thread (0,0) reads: A[0,0], A[0,1], A[0,2] ... → CONSECUTIVE ✓
B[0,0], B[N,0], B[2N,0] ... → STRIDE-N ✗
Thread (0,1) reads: A[0,0], A[0,1], A[0,2] ... → Same as thread 0!
B[0,1], B[N,1], B[2N,1] ... → STRIDE-N ✗When reading matrix B, consecutive threads access elements separated by N floats (stride-N access). This causes:
- Memory request serialization — GPU must issue separate loads
- Cache inefficiency — loaded data isn't shared between threads
- ~12.5% bandwidth utilization — vs 100% with coalesced access
Visual Representation
Memory Layout:
A (row-major): [0,0] [0,1] [0,2] ... [1,0] [1,1] ...
B (row-major): [0,0] [0,1] [0,2] ... [1,0] [1,1] ...
↑
Threads need [k, col]
With col = 0,1,2,3...
This is NOT consecutive in memory!Code Walkthrough
1. Kernel Signature
cpp
__global__ void sgemm_naive_kernel(
const float* A, const float* B, float* C,
int M, int N, int K)__global__— CUDA kernel launchable from CPU- Pointers marked
constfor read-only matrices A and B - Dimensions passed as
int(matrices assumed row-major)
2. Thread Indexing
cpp
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;- Uses 2D grid/block structure for natural matrix mapping
blockIdxidentifies which tile of outputthreadIdxidentifies position within tile
3. Bounds Checking
cpp
if (row < M && col < N) {Essential for matrices not perfectly divisible by block size. Prevents out-of-bounds writes.
4. Inner Product Computation
cpp
float sum = 0.0f;
for (int k = 0; k < K; ++k) {
sum += A[row * K + k] * B[k * N + col];
}- Accumulator in register (
float sum) - Single loop over K dimension
- Row of A × Column of B
Learning Checkpoints
After understanding this kernel, you should be able to:
- [x] Explain CUDA's thread hierarchy (grid/block/thread)
- [x] Identify uncoalesced memory access patterns
- [x] Calculate arithmetic intensity for matrix multiply
- [x] Understand why this kernel is memory-bound