Performance Benchmarking
Learn how to measure and compare the performance of operators and model configurations using Tiny-DL-Inference's benchmarking utilities.
Overview
The performance benchmarking example (/examples/benchmark-demo.ts) demonstrates:
- Benchmarking kernel fusion benefits
- Comparing NCHW vs. NHWC memory layouts
- Measuring execution time across multiple iterations
- Calculating speedup ratios and memory traffic reduction
Setup
Imports
import {
GPUContext,
Tensor,
Conv2dOperator,
ReLUOperator,
Conv2dBiasReLUOperator,
Benchmark
} from 'tiny-dl-inference';2
3
4
5
6
7
8
Initialize GPU Context
const context = new GPUContext();
await context.initialize();
console.log('GPU Context initialized');2
3
Benchmark 1: Kernel Fusion
Kernel fusion combines multiple operations into a single GPU shader pass, reducing memory traffic and improving performance.
Create Test Tensors
const inputShape = [1, 32, 56, 56]; // Typical intermediate layer size
const weightShape = [64, 32, 3, 3]; // 64 output channels, 3x3 kernel
const biasShape = [64];
const input = Tensor.zeros(context, inputShape);
const weight = Tensor.zeros(context, weightShape);
const bias = Tensor.zeros(context, biasShape);2
3
4
5
6
7
Define Convolution Parameters
const params = {
kernelSize: [3, 3] as [number, number],
stride: [1, 1] as [number, number],
padding: [1, 1] as [number, number],
useBias: true
};2
3
4
5
6
Measure Separate Execution
const conv2dOp = new Conv2dOperator(context);
const reluOp = new ReLUOperator(context);
const separateStart = performance.now();
for (let i = 0; i < 50; i++) {
const convOut = await conv2dOp.forward([input, weight, bias], params);
const reluOut = await reluOp.forward([convOut]);
convOut.destroy();
reluOut.destroy();
}
const separateTime = (performance.now() - separateStart) / 50;2
3
4
5
6
7
8
9
10
11
Measure Fused Execution
const fusedOp = new Conv2dBiasReLUOperator(context);
const fusedStart = performance.now();
for (let i = 0; i < 50; i++) {
const fusedOut = await fusedOp.forward([input, weight, bias], params);
fusedOut.destroy();
}
const fusedTime = (performance.now() - fusedStart) / 50;2
3
4
5
6
7
8
Calculate Speedup
const speedup = separateTime / fusedTime;
console.log(`Separate execution: ${separateTime.toFixed(2)}ms`);
console.log(`Fused execution: ${fusedTime.toFixed(2)}ms`);
console.log(`Speedup: ${speedup.toFixed(2)}x`);
console.log(`Memory traffic reduction: ~${((1 - 1/3) * 100).toFixed(0)}%`);2
3
4
5
Why Fusion Helps
| Metric | Separate | Fused | Improvement |
|---|---|---|---|
| GPU Passes | 2 | 1 | 50% reduction |
| Memory Traffic | High | Low | ~67% reduction |
| Intermediate Tensors | Allocated | None | Zero allocation |
Fused operators write the intermediate result directly to the output buffer instead of creating a temporary tensor, reducing both memory allocation and memory bandwidth usage.
Benchmark 2: Memory Layout Comparison
Compare performance between NCHW and NHWC memory layouts for convolution operations.
Create Tensors in Both Layouts
const inputNCHW = Tensor.zeros(context, [1, 32, 56, 56], { layout: 'NCHW' });
const inputNHWC = Tensor.zeros(context, [1, 56, 56, 32], { layout: 'NHWC' });2
Measure NCHW Performance
const nchwStart = performance.now();
for (let i = 0; i < 50; i++) {
const out = await conv2dOp.forward([inputNCHW, weight, bias], params);
out.destroy();
}
const nchwTime = (performance.now() - nchwStart) / 50;2
3
4
5
6
Measure NHWC Performance
const nhwcStart = performance.now();
for (let i = 0; i < 50; i++) {
const out = await conv2dOp.forward([inputNHWC, weight, bias], params);
out.destroy();
}
const nhwcTime = (performance.now() - nhwcStart) / 50;2
3
4
5
6
Compare Results
console.log(`NCHW layout: ${nchwTime.toFixed(2)}ms`);
console.log(`NHWC layout: ${nhwcTime.toFixed(2)}ms`);
console.log(`NHWC advantage: ${((nchwTime / nhwcTime - 1) * 100).toFixed(1)}%`);
console.log('(NHWC provides better memory coalescing for spatial operations)');2
3
4
Layout Performance Notes
- NCHW - Natural for convolution operations, better cache locality for channel-wise operations
- NHWC - Better memory coalescing on GPUs, more efficient for spatial operations
Note
Conv2d and MaxPool currently execute in NCHW. NHWC inputs are automatically converted internally.
General Benchmarking Pattern
Use this pattern for benchmarking any operator or model:
async function benchmark(
fn: () => Promise<void>,
iterations: number = 50,
warmup: number = 5
): Promise<{ meanMs: number; minMs: number; maxMs: number }> {
// Warmup
for (let i = 0; i < warmup; i++) {
await fn();
}
// Measure
const times: number[] = [];
for (let i = 0; i < iterations; i++) {
const start = performance.now();
await fn();
const end = performance.now();
times.push(end - start);
}
const mean = times.reduce((a, b) => a + b) / times.length;
const min = Math.min(...times);
const max = Math.max(...times);
console.log(`Mean: ${mean.toFixed(3)}ms`);
console.log(`Min: ${min.toFixed(3)}ms`);
console.log(`Max: ${max.toFixed(3)}ms`);
return { meanMs: mean, minMs: min, maxMs: max };
}2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Usage
const result = await benchmark(async () => {
const output = await engine.infer(input);
output.destroy();
});2
3
4
Benchmark Utilities
Benchmark Class
The Benchmark class provides utilities for measuring operator performance:
import { Benchmark } from 'tiny-dl-inference';
const benchmark = new Benchmark();2
3
Key Methods
| Method | Description |
|---|---|
measureOperator() | Measures operator execution time over multiple iterations |
warmup() | Runs warmup iterations to stabilize GPU state |
Interpreting Results
Speedup Ratio
const speedup = baselineTime / optimizedTime;speedup > 1- Optimization is fasterspeedup < 1- Baseline is faster
Memory Traffic Reduction
Fused operators reduce memory traffic by eliminating intermediate tensor writes:
const reduction = (1 - 1 / numOperations) * 100;
// For 3 operations (conv + bias + relu): ~67% reduction2
Statistical Significance
Always run enough iterations to get stable results:
- Minimum: 10 iterations
- Recommended: 50-100 iterations
- Production: 500+ iterations for stable averages
Cleanup
Always destroy tensors and context after benchmarking:
input.destroy();
weight.destroy();
bias.destroy();
inputNCHW.destroy();
inputNHWC.destroy();
context.destroy();2
3
4
5
6
Running the Benchmark
npx ts-node examples/benchmark-demo.tsThe demo will output a performance summary:
=== Tiny-DL-Inference Performance Benchmark ===
--- Benchmark 1: Kernel Fusion ---
Separate execution: X.XXms
Fused execution: X.XXms
Speedup: X.XXx
Memory traffic reduction: ~67%
--- Benchmark 2: Memory Layout Comparison ---
NCHW layout: X.XXms
NHWC layout: X.XXms
NHWC advantage: X.X%
--- Performance Summary ---
Kernel fusion provides X.XXx speedup
Memory layout optimization improves performance
Combined optimizations significantly reduce inference time2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Next Steps
- See MNIST Example for a complete inference pipeline
- Learn about Memory Layout for optimization strategies
- Read the API Reference for detailed operator documentation