The simplest approach — each thread computes one output element
Overview
The naïve kernel is our baseline implementation. It follows the most straightforward approach to matrix multiplication: each CUDA thread is responsible for computing exactly one element of the output matrix C.
Learning Goal
Understand the basic CUDA programming model and identify why this "obvious" approach performs poorly on GPUs.
Algorithm
For matrix multiplication C = A × B where:
A is M × K
B is K × N
C is M × N
Each thread computes:
1
C[row, col] = Σ A[row, k] × B[k, col] for k = 0 to K-1
Thread Mapping
1
2
3
4
5
6
7
Grid: (N + 15) / 16 × (M + 15) / 16 blocks
Block: 16 × 16 threads
Thread (tx, ty) in block (bx, by) computes:
row = by × 16 + ty
col = bx × 16 + tx
C[row, col] if row < M and col < N
// File: src/kernels/naive_sgemm.cuh__global__voidsgemm_naive_kernel(constfloat*A,constfloat*B,float*C,intM,intN,intK){// Calculate global row and columnintrow=blockIdx.y*blockDim.y+threadIdx.y;intcol=blockIdx.x*blockDim.x+threadIdx.x;// Bounds checkif(row<M&&col<N){floatsum=0.0f;// Compute dot product of row and columnfor(intk=0;k<K;++k){sum+=A[row*K+k]*B[k*N+col];}// Write resultC[row*N+col]=sum;}}
When reading matrix B, consecutive threads access elements separated by N floats (stride-N access). This causes:
Memory request serialization — GPU must issue separate loads
Cache inefficiency — loaded data isn’t shared between threads
~12.5% bandwidth utilization — vs 100% with coalesced access
Visual Representation
1
2
3
4
5
6
7
Memory Layout:
A (row-major): [0,0] [0,1] [0,2] ... [1,0] [1,1] ...
B (row-major): [0,0] [0,1] [0,2] ... [1,0] [1,1] ...
↑
Threads need [k, col]
With col = 0,1,2,3...
This is NOT consecutive in memory!
Performance Characteristics
Metric
Value
Analysis
GFLOPS (1024³)
~604
Baseline
Memory Access
Strided for B
Bank conflicts likely
Data Reuse
None
Each element read once per use
Arithmetic Intensity
2 FLOPs / 8 bytes
Memory-bound
Roofline Position
Located deep in the memory-bound region — performance is entirely limited by memory bandwidth, not compute capability.