Performance Benchmarking

Learn how to measure and compare the performance of operators and model configurations using Tiny-DL-Inference's benchmarking utilities.

Overview

The performance benchmarking example (/examples/benchmark-demo.ts) demonstrates:

Benchmarking kernel fusion benefits
Comparing NCHW vs. NHWC memory layouts
Measuring execution time across multiple iterations
Calculating speedup ratios and memory traffic reduction

Setup

Imports

typescript

import {
  GPUContext,
  Tensor,
  Conv2dOperator,
  ReLUOperator,
  Conv2dBiasReLUOperator,
  Benchmark
} from 'tiny-dl-inference';

Initialize GPU Context

typescript

const context = new GPUContext();
await context.initialize();
console.log('GPU Context initialized');

Benchmark 1: Kernel Fusion

Kernel fusion combines multiple operations into a single GPU shader pass, reducing memory traffic and improving performance.

Create Test Tensors

typescript

const inputShape = [1, 32, 56, 56];  // Typical intermediate layer size
const weightShape = [64, 32, 3, 3];  // 64 output channels, 3x3 kernel
const biasShape = [64];

const input = Tensor.zeros(context, inputShape);
const weight = Tensor.zeros(context, weightShape);
const bias = Tensor.zeros(context, biasShape);

Define Convolution Parameters

typescript

const params = {
  kernelSize: [3, 3] as [number, number],
  stride: [1, 1] as [number, number],
  padding: [1, 1] as [number, number],
  useBias: true
};

Measure Separate Execution

typescript

const conv2dOp = new Conv2dOperator(context);
const reluOp = new ReLUOperator(context);

const separateStart = performance.now();
for (let i = 0; i < 50; i++) {
  const convOut = await conv2dOp.forward([input, weight, bias], params);
  const reluOut = await reluOp.forward([convOut]);
  convOut.destroy();
  reluOut.destroy();
}
const separateTime = (performance.now() - separateStart) / 50;

Measure Fused Execution

typescript

const fusedOp = new Conv2dBiasReLUOperator(context);

const fusedStart = performance.now();
for (let i = 0; i < 50; i++) {
  const fusedOut = await fusedOp.forward([input, weight, bias], params);
  fusedOut.destroy();
}
const fusedTime = (performance.now() - fusedStart) / 50;

Calculate Speedup

typescript

const speedup = separateTime / fusedTime;
console.log(`Separate execution: ${separateTime.toFixed(2)}ms`);
console.log(`Fused execution: ${fusedTime.toFixed(2)}ms`);
console.log(`Speedup: ${speedup.toFixed(2)}x`);
console.log(`Memory traffic reduction: ~${((1 - 1/3) * 100).toFixed(0)}%`);

Why Fusion Helps

Metric	Separate	Fused	Improvement
GPU Passes	2	1	50% reduction
Memory Traffic	High	Low	~67% reduction
Intermediate Tensors	Allocated	None	Zero allocation

Fused operators write the intermediate result directly to the output buffer instead of creating a temporary tensor, reducing both memory allocation and memory bandwidth usage.

Benchmark 2: Memory Layout Comparison

Compare performance between NCHW and NHWC memory layouts for convolution operations.

Create Tensors in Both Layouts

typescript

const inputNCHW = Tensor.zeros(context, [1, 32, 56, 56], { layout: 'NCHW' });
const inputNHWC = Tensor.zeros(context, [1, 56, 56, 32], { layout: 'NHWC' });

Measure NCHW Performance

typescript

const nchwStart = performance.now();
for (let i = 0; i < 50; i++) {
  const out = await conv2dOp.forward([inputNCHW, weight, bias], params);
  out.destroy();
}
const nchwTime = (performance.now() - nchwStart) / 50;

Measure NHWC Performance

typescript

const nhwcStart = performance.now();
for (let i = 0; i < 50; i++) {
  const out = await conv2dOp.forward([inputNHWC, weight, bias], params);
  out.destroy();
}
const nhwcTime = (performance.now() - nhwcStart) / 50;

Compare Results

typescript

console.log(`NCHW layout: ${nchwTime.toFixed(2)}ms`);
console.log(`NHWC layout: ${nhwcTime.toFixed(2)}ms`);
console.log(`NHWC advantage: ${((nchwTime / nhwcTime - 1) * 100).toFixed(1)}%`);
console.log('(NHWC provides better memory coalescing for spatial operations)');

Layout Performance Notes

NCHW - Natural for convolution operations, better cache locality for channel-wise operations
NHWC - Better memory coalescing on GPUs, more efficient for spatial operations

Note

Conv2d and MaxPool currently execute in NCHW. NHWC inputs are automatically converted internally.

General Benchmarking Pattern

Use this pattern for benchmarking any operator or model:

typescript

async function benchmark(
  fn: () => Promise<void>,
  iterations: number = 50,
  warmup: number = 5
): Promise<{ meanMs: number; minMs: number; maxMs: number }> {
  // Warmup
  for (let i = 0; i < warmup; i++) {
    await fn();
  }

  // Measure
  const times: number[] = [];
  for (let i = 0; i < iterations; i++) {
    const start = performance.now();
    await fn();
    const end = performance.now();
    times.push(end - start);
  }

  const mean = times.reduce((a, b) => a + b) / times.length;
  const min = Math.min(...times);
  const max = Math.max(...times);

  console.log(`Mean: ${mean.toFixed(3)}ms`);
  console.log(`Min: ${min.toFixed(3)}ms`);
  console.log(`Max: ${max.toFixed(3)}ms`);

  return { meanMs: mean, minMs: min, maxMs: max };
}

Usage

typescript

const result = await benchmark(async () => {
  const output = await engine.infer(input);
  output.destroy();
});

Benchmark Utilities

Benchmark Class

The Benchmark class provides utilities for measuring operator performance:

typescript

import { Benchmark } from 'tiny-dl-inference';

const benchmark = new Benchmark();

Key Methods

Method	Description
`measureOperator()`	Measures operator execution time over multiple iterations
`warmup()`	Runs warmup iterations to stabilize GPU state

Interpreting Results

Speedup Ratio

typescript

const speedup = baselineTime / optimizedTime;

speedup > 1 - Optimization is faster
speedup < 1 - Baseline is faster

Memory Traffic Reduction

Fused operators reduce memory traffic by eliminating intermediate tensor writes:

typescript

const reduction = (1 - 1 / numOperations) * 100;
// For 3 operations (conv + bias + relu): ~67% reduction

Statistical Significance

Always run enough iterations to get stable results:

Minimum: 10 iterations
Recommended: 50-100 iterations
Production: 500+ iterations for stable averages

Cleanup

Always destroy tensors and context after benchmarking:

typescript

input.destroy();
weight.destroy();
bias.destroy();
inputNCHW.destroy();
inputNHWC.destroy();
context.destroy();

Running the Benchmark

bash

npx ts-node examples/benchmark-demo.ts

The demo will output a performance summary:

=== Tiny-DL-Inference Performance Benchmark ===

--- Benchmark 1: Kernel Fusion ---
Separate execution: X.XXms
Fused execution: X.XXms
Speedup: X.XXx
Memory traffic reduction: ~67%

--- Benchmark 2: Memory Layout Comparison ---
NCHW layout: X.XXms
NHWC layout: X.XXms
NHWC advantage: X.X%

--- Performance Summary ---
Kernel fusion provides X.XXx speedup
Memory layout optimization improves performance
Combined optimizations significantly reduce inference time

Next Steps

See MNIST Example for a complete inference pipeline
Learn about Memory Layout for optimization strategies
Read the API Reference for detailed operator documentation

Performance Benchmarking ​

Overview ​

Setup ​

Imports ​

Initialize GPU Context ​

Benchmark 1: Kernel Fusion ​

Create Test Tensors ​

Define Convolution Parameters ​

Measure Separate Execution ​

Measure Fused Execution ​

Calculate Speedup ​

Why Fusion Helps ​

Benchmark 2: Memory Layout Comparison ​

Create Tensors in Both Layouts ​

Measure NCHW Performance ​

Measure NHWC Performance ​

Compare Results ​

Layout Performance Notes ​

General Benchmarking Pattern ​

Usage ​

Benchmark Utilities ​

Benchmark Class ​

Key Methods ​

Interpreting Results ​

Speedup Ratio ​

Memory Traffic Reduction ​

Statistical Significance ​

Cleanup ​

Running the Benchmark ​

Next Steps ​

Performance Benchmarking

Overview

Setup

Imports

Initialize GPU Context

Benchmark 1: Kernel Fusion

Create Test Tensors

Define Convolution Parameters

Measure Separate Execution

Measure Fused Execution

Calculate Speedup

Why Fusion Helps

Benchmark 2: Memory Layout Comparison

Create Tensors in Both Layouts

Measure NCHW Performance

Measure NHWC Performance

Compare Results

Layout Performance Notes

General Benchmarking Pattern

Usage

Benchmark Utilities

Benchmark Class

Key Methods

Interpreting Results

Speedup Ratio

Memory Traffic Reduction

Statistical Significance

Cleanup

Running the Benchmark

Next Steps