Requirements Document: Mini-Inference Engine
Introduction
This project implements a lightweight neural network inference engine with a focus on GEMM (General Matrix Multiply) optimization techniques. The goal is to achieve GEMM performance at 70%-80% of cuBLAS.
Glossary
| Term | Definition |
|---|---|
| Inference Engine | Neural network inference engine responsible for loading model weights and executing forward propagation |
| GEMM | General Matrix Multiply, C = α·A·B + β·C |
| Kernel | CUDA kernel function executed in parallel on GPU |
| Tiling | Blocking technique that divides large matrices into smaller tiles to utilize shared memory |
| Memory Coalescing | Memory optimization to maximize bandwidth utilization |
| Double Buffering | Technique to hide memory latency through prefetching |
| Register Blocking | Register optimization to improve compute density |
| Kernel Fusion | Operator fusion that combines multiple operations into a single kernel |
Requirements
R1: Naive Matrix Multiplication
User Story: As a developer, I want to implement a basic matrix multiplication kernel to establish a performance baseline.
| ID | Acceptance Criteria |
|---|---|
| R1.1 | Support arbitrary size matrix multiplication C = A × B |
| R1.2 | Given input matrices A(M×K) and B(K×N), correctly output C(M×N) |
| R1.3 | Each thread computes one output element |
| R1.4 | Result matches CPU reference implementation with error < 1e-5 |
| R1.5 | Record execution time and GFLOPS |
R2: Tiling Optimization
User Story: As a developer, I want to use tiling technique to optimize matrix multiplication and reduce global memory accesses.
| ID | Acceptance Criteria |
|---|---|
| R2.1 | Divide input matrices into 32×32 tiles |
| R2.2 | Load tile data from global memory to shared memory |
| R2.3 | Compute using shared memory data |
| R2.4 | Correctly handle boundary conditions |
| R2.5 | Achieve ≥ 5x performance improvement over Naive |
R3: Memory Coalescing
User Story: As a developer, I want to optimize memory access patterns to maximize bandwidth utilization.
| ID | Acceptance Criteria |
|---|---|
| R3.1 | Threads in the same warp access contiguous memory addresses |
| R3.2 | Matrix A uses row-major access pattern |
| R3.3 | Matrix B uses optimized access pattern |
| R3.4 | Achieve ≥ 20% performance improvement |
R4: Double Buffering
User Story: As a developer, I want to implement double buffering technique to hide memory latency.
| ID | Acceptance Criteria |
|---|---|
| R4.1 | Use two sets of shared memory buffers |
| R4.2 | Prefetch next tile while computing current tile |
| R4.3 | Achieve overlap between computation and data transfer |
| R4.4 | Achieve ≥ 15% performance improvement |
R5: Register Blocking
User Story: As a developer, I want to optimize register usage to improve compute density.
| ID | Acceptance Criteria |
|---|---|
| R5.1 | Use registers to store intermediate results |
| R5.2 | Avoid shared memory bank conflicts |
| R5.3 | Use vectorized loads (float4) |
| R5.4 | Achieve 70%-80% of cuBLAS performance |
R6: Kernel Fusion
User Story: As a developer, I want to fuse MatMul + Bias + ReLU into a single kernel to reduce memory reads/writes.
| ID | Acceptance Criteria |
|---|---|
| R6.1 | Single kernel completes Y = ReLU(X × W + bias) |
| R6.2 | Add bias directly in registers |
| R6.3 | Apply ReLU directly in registers |
| R6.4 | Reduce time by ≥ 30% compared to separate kernels |
| R6.5 | Support optional bias and activation configuration |
R7: Weight Loading and Inference
User Story: As a developer, I want to load pretrained weights and execute inference to validate engine correctness.
| ID | Acceptance Criteria |
|---|---|
| R7.1 | Support loading weights from binary files |
| R7.2 | Support simple fully-connected network (MNIST) |
| R7.3 | Correctly execute multi-layer forward propagation |
| R7.4 | MNIST accuracy matches reference |
| R7.5 | Report per-layer execution time |
R8: Performance Benchmarking
User Story: As a developer, I want to run performance benchmarks to quantify optimization results.
| ID | Acceptance Criteria |
|---|---|
| R8.1 | Test matrix sizes: 256, 512, 1024, 2048, 4096 |
| R8.2 | Record GFLOPS and cuBLAS performance ratio |
| R8.3 | Generate performance comparison report |
| R8.4 | Validate numerical correctness for all versions |
| R8.5 | Report mean and standard deviation over multiple iterations |
R9: Error Handling and Resource Management
User Story: As a developer, I want robust error handling to safely manage exceptional conditions.
| ID | Acceptance Criteria |
|---|---|
| R9.1 | CUDA API failures return descriptive errors |
| R9.2 | Safely release resources on memory allocation failures |
| R9.3 | Correctly release all GPU resources on program exit |
| R9.4 | Detect and report dimension mismatches |
| R9.5 | Convert CUDA error codes to human-readable messages |
Correctness Properties
P1: Matrix Multiplication Correctness
For any matrices A(M×K) and B(K×N), GPU result C matches CPU reference within max error 1e-5.
Validates: R1.1, R1.2, R1.4
P2: Optimized GEMM Equivalence
For any matrices A and B, all optimized implementations produce equivalent results to Naive.
Validates: R2.4, R8.4
P3: Kernel Fusion Correctness
For any input X, weight W, bias b, fused output equals sequential MatMul + Bias + ReLU.
Validates: R6.1, R6.5
P4: Weight Serialization Round-Trip
For any valid weights, save and load produces bit-identical results.
Validates: R7.1
P5: Multi-Layer Forward Pass Consistency
For any input batch, forward pass produces equivalent results to sequential layer execution.
Validates: R7.3
P6: Dimension Mismatch Detection
For matrices where A.cols != B.rows, engine detects and reports error before computation.
Validates: R9.4