NVIDIA Tensor Cores are specialized units that perform mixed-precision matrix multiply-accumulate operations. A single instruction can compute a 4×4×4 or larger matrix operation — achieving ~8× theoretical peak throughput of CUDA cores (~3-4× in practice).
Key Insight
Tensor Cores use FP16 inputs with FP32 accumulation. This mixed precision provides significant speedup while maintaining accuracy for most deep learning and HPC workloads.
Tensor Core Architecture
Hardware Capabilities
Generation
Architecture
Operations/Cycle
Precision
Volta (V100)
sm_70
64 FMA
FP16/FP32
Turing (RTX 20)
sm_75
64 FMA
FP16/INT8/INT32
Ampere (A100/RTX 30)
sm_80/sm_86
256 FMA
FP16/BF16/TF32
Ada (RTX 40)
sm_89
512 FMA
FP16/BF16/TF32
Hopper (H100)
sm_90
1024 FMA
FP8/FP16/BF16
WMMA Fragment Size
1
2
3
4
5
6
7
8
9
10
11
Warp Matrix Multiply Accumulate (WMMA):
Fragment A: 16×16 FP16 matrix (row-major)
Fragment B: 16×16 FP16 matrix (row-major)
Fragment C: 16×16 FP32 matrix (row-major)
↓
D = A × B + C
↓
Fragment D: 16×16 FP32 matrix
One warp (32 threads) collaborates on one 16×16×16 operation.
// Initialize accumulator to zerofill_fragment(c_frag,0.0f);// Load data from global/shared memoryload_matrix_sync(a_frag,A_ptr,lda);load_matrix_sync(b_frag,B_ptr,ldb);// Perform matrix multiply-accumulatemma_sync(d_frag,a_frag,b_frag,c_frag);// Store resultstore_matrix_sync(D_ptr,d_frag,ldd,mem_row_major);
// File: src/kernels/tensor_core_sgemm.cuh#include<mma.h>usingnamespacenvcuda::wmma;// WMMA uses 16×16 tilesconstintWMMA_M=16;constintWMMA_N=16;constintWMMA_K=16;__global__voidsgemm_tensor_core_kernel(consthalf*__restrict__A,consthalf*__restrict__B,float*__restrict__C,intM,intN,intK){// Each warp computes one 16×16 output tileintwarpM=(blockIdx.y*blockDim.y+threadIdx.y)/warpSize;intwarpN=(blockIdx.x*blockDim.x+threadIdx.x)/warpSize;// Check if this warp has workif(warpM*WMMA_M>=M||warpN*WMMA_N>=N)return;// Declare fragmentsfragment<matrix_a,WMMA_M,WMMA_N,WMMA_K,half,row_major>a_frag;fragment<matrix_b,WMMA_M,WMMA_N,WMMA_K,half,row_major>b_frag;fragment<accumulator,WMMA_M,WMMA_N,WMMA_K,float>acc_frag;// Initialize accumulator to zerofill_fragment(acc_frag,0.0f);// Iterate over K dimension in 16-element chunksfor(intk=0;k<K;k+=WMMA_K){// Calculate pointers for this tileconsthalf*a_ptr=A+warpM*WMMA_M*K+k;consthalf*b_ptr=B+k*N+warpN*WMMA_N;// Load fragmentsload_matrix_sync(a_frag,a_ptr,K);load_matrix_sync(b_frag,b_ptr,N);// Perform MMAmma_sync(acc_frag,a_frag,b_frag,acc_frag);}// Store resultfloat*c_ptr=C+warpM*WMMA_M*N+warpN*WMMA_N;store_matrix_sync(c_ptr,acc_frag,N,mem_row_major);}
FP32 → FP16 Conversion
Since Tensor Cores require FP16 input, we must convert:
// FP32 to FP16 conversion kernel__global__voidconvert_fp32_to_fp16(constfloat*__restrict__in,half*__restrict__out,intn){intidx=blockIdx.x*blockDim.x+threadIdx.x;if(idx<n){out[idx]=__float2half(in[idx]);}}// Host function to launch Tensor Core kernelvoidlaunch_tensor_core_sgemm(constfloat*A_fp32,constfloat*B_fp32,float*C_fp32,intM,intN,intK,cudaStream_tstream){// Check alignment for WMMAboolaligned=(M%16==0)&&(N%16==0)&&(K%16==0);if(!aligned){// Fall back to FP32 tiled kernel for non-aligned sizessgemm_tiled<<<grid,block,0,stream>>>(A_fp32,B_fp32,C_fp32,M,N,K);return;}// Allocate temporary FP16 buffers via RAII wrappersDeviceMemory<half>A_fp16(M*K);DeviceMemory<half>B_fp16(K*N);// Convert inputsintthreads=256;intblocks_A=(M*K+threads-1)/threads;intblocks_B=(K*N+threads-1)/threads;convert_fp32_to_fp16<<<blocks_A,threads,0,stream>>>(A_fp32,A_fp16.get(),M*K);convert_fp32_to_fp16<<<blocks_B,threads,0,stream>>>(B_fp32,B_fp16.get(),K*N);// Launch WMMA kerneldim3block(16,4);// 64 threads (2 warps)dim3grid((N+WMMA_N-1)/WMMA_N/2,(M+WMMA_M-1)/WMMA_M);sgemm_tensor_core_kernel<<<grid,block,0,stream>>>(A_fp16.get(),B_fp16.get(),C_fp32,M,N,K);// No manual cleanup: the RAII wrappers release device memory automatically}
Mixed Precision Considerations
Accuracy Trade-off
1
2
3
4
5
FP32 precision: ~7 decimal digits
FP16 precision: ~3 decimal digits
Conversion FP32 → FP16 introduces quantization error.
But FP32 accumulation maintains precision for the sum.
Each warp’s 32 threads hold the 16×16 matrix collaboratively:
1
2
3
4
5
6
7
8
9
10
11
12
16×16 matrix distributed across 32 threads:
Thread Layout (8 rows × 4 columns of threads):
┌───┬───┬───┬───┐
│ 0 │ 1 │ 2 │ 3 │ Row 0 holds elements at column 0-3
├───┼───┼───┼───┤
│ 4 │ 5 │ 6 │ 7 │ Row 1 holds elements at column 0-3
├───┼───┼───┼───┤
│...│ │ │ │
└───┴───┴───┴───┘
Each thread holds 8 FP16 values (4 rows × 2 columns).
The exact mapping is hardware-defined and managed by WMMA APIs.
Architecture Guards
Use conditional compilation for different GPU generations:
1
2
3
4
5
6
7
8
#if __CUDA_ARCH__ >= 700
// WMMA available (Volta+)#include<mma.h>// Tensor Core implementation#else
// Fallback to FP32 kernel// CUDA < 7.0 or no Tensor Cores#endif
Optimization Opportunities
Our Tensor Core kernel achieves only 40% of cuBLAS performance. The gap comes from: