性能基准测试

学习如何使用 Tiny-DL-Inference 的基准测试工具来测量和比较算子及模型配置的性能。

概述

性能基准测试示例（/examples/benchmark-demo.ts）演示了：

基准测试内核融合的收益
比较 NCHW 与 NHWC 内存布局
测量多次迭代的执行时间
计算加速比和内存流量减少

环境准备

导入

typescript

import {
  GPUContext,
  Tensor,
  Conv2dOperator,
  ReLUOperator,
  Conv2dBiasReLUOperator,
  Benchmark
} from 'tiny-dl-inference';

初始化 GPU 上下文

typescript

const context = new GPUContext();
await context.initialize();
console.log('GPU 上下文已初始化');

基准测试 1：内核融合

内核融合将多个操作合并到单个 GPU 着色器通道中，减少内存流量并提高性能。

创建测试张量

typescript

const inputShape = [1, 32, 56, 56];  // 典型的中间层大小
const weightShape = [64, 32, 3, 3];  // 64 个输出通道，3x3 卷积核
const biasShape = [64];

const input = Tensor.zeros(context, inputShape);
const weight = Tensor.zeros(context, weightShape);
const bias = Tensor.zeros(context, biasShape);

定义卷积参数

typescript

const params = {
  kernelSize: [3, 3] as [number, number],
  stride: [1, 1] as [number, number],
  padding: [1, 1] as [number, number],
  useBias: true
};

测量分离执行

typescript

const conv2dOp = new Conv2dOperator(context);
const reluOp = new ReLUOperator(context);

const separateStart = performance.now();
for (let i = 0; i < 50; i++) {
  const convOut = await conv2dOp.forward([input, weight, bias], params);
  const reluOut = await reluOp.forward([convOut]);
  convOut.destroy();
  reluOut.destroy();
}
const separateTime = (performance.now() - separateStart) / 50;

测量融合执行

typescript

const fusedOp = new Conv2dBiasReLUOperator(context);

const fusedStart = performance.now();
for (let i = 0; i < 50; i++) {
  const fusedOut = await fusedOp.forward([input, weight, bias], params);
  fusedOut.destroy();
}
const fusedTime = (performance.now() - fusedStart) / 50;

计算加速比

typescript

const speedup = separateTime / fusedTime;
console.log(`分离执行：${separateTime.toFixed(2)}ms`);
console.log(`融合执行：${fusedTime.toFixed(2)}ms`);
console.log(`加速比：${speedup.toFixed(2)}x`);
console.log(`内存流量减少：~${((1 - 1/3) * 100).toFixed(0)}%`);

为什么融合有帮助

指标	分离执行	融合执行	改进
GPU 通道数	2	1	减少 50%
内存流量	高	低	减少约 67%
中间张量	已分配	无	零分配

融合算子将中间结果直接写入输出缓冲区，而不是创建临时张量，从而减少内存分配和内存带宽使用。

基准测试 2：内存布局比较

比较卷积操作中 NCHW 和 NHWC 内存布局的性能。

创建两种布局的张量

typescript

const inputNCHW = Tensor.zeros(context, [1, 32, 56, 56], { layout: 'NCHW' });
const inputNHWC = Tensor.zeros(context, [1, 56, 56, 32], { layout: 'NHWC' });

测量 NCHW 性能

typescript

const nchwStart = performance.now();
for (let i = 0; i < 50; i++) {
  const out = await conv2dOp.forward([inputNCHW, weight, bias], params);
  out.destroy();
}
const nchwTime = (performance.now() - nchwStart) / 50;

测量 NHWC 性能

typescript

const nhwcStart = performance.now();
for (let i = 0; i < 50; i++) {
  const out = await conv2dOp.forward([inputNHWC, weight, bias], params);
  out.destroy();
}
const nhwcTime = (performance.now() - nhwcStart) / 50;

比较结果

typescript

console.log(`NCHW 布局：${nchwTime.toFixed(2)}ms`);
console.log(`NHWC 布局：${nhwcTime.toFixed(2)}ms`);
console.log(`NHWC 优势：${((nchwTime / nhwcTime - 1) * 100).toFixed(1)}%`);
console.log('（NHWC 为空间操作提供更好的内存合并）');

布局性能说明

NCHW - 自然适合卷积操作，通道操作的缓存局部性更好
NHWC - 在 GPU 上具有更好的内存合并，空间操作更高效

注意

Conv2d 和 MaxPool 当前在 NCHW 中执行。NHWC 输入会自动在内部转换。

通用基准测试模式

使用此模式对任何算子或模型进行基准测试：

typescript

async function benchmark(
  fn: () => Promise<void>,
  iterations: number = 50,
  warmup: number = 5
): Promise<{ meanMs: number; minMs: number; maxMs: number }> {
  // 预热
  for (let i = 0; i < warmup; i++) {
    await fn();
  }

  // 测量
  const times: number[] = [];
  for (let i = 0; i < iterations; i++) {
    const start = performance.now();
    await fn();
    const end = performance.now();
    times.push(end - start);
  }

  const mean = times.reduce((a, b) => a + b) / times.length;
  const min = Math.min(...times);
  const max = Math.max(...times);

  console.log(`平均：${mean.toFixed(3)}ms`);
  console.log(`最小：${min.toFixed(3)}ms`);
  console.log(`最大：${max.toFixed(3)}ms`);

  return { meanMs: mean, minMs: min, maxMs: max };
}

使用示例

typescript

const result = await benchmark(async () => {
  const output = await engine.infer(input);
  output.destroy();
});

基准测试工具

Benchmark 类

Benchmark 类提供测量算子性能的工具：

typescript

import { Benchmark } from 'tiny-dl-inference';

const benchmark = new Benchmark();

关键方法

方法	描述
`measureOperator()`	在多次迭代中测量算子执行时间
`warmup()`	运行预热迭代以稳定 GPU 状态

解释结果

加速比

typescript

const speedup = baselineTime / optimizedTime;

speedup > 1 - 优化更快
speedup < 1 - 基线更快

内存流量减少

融合算子通过消除中间张量写入来减少内存流量：

typescript

const reduction = (1 - 1 / numOperations) * 100;
// 对于 3 个操作（conv + bias + relu）：约 67% 减少

统计显著性

始终运行足够的迭代次数以获得稳定的结果：

最少：10 次迭代
推荐：50-100 次迭代
生产环境：500+ 次迭代以获得稳定的平均值

清理

基准测试后始终销毁张量和上下文：

typescript

input.destroy();
weight.destroy();
bias.destroy();
inputNCHW.destroy();
inputNHWC.destroy();
context.destroy();

运行基准测试

bash

npx ts-node examples/benchmark-demo.ts

演示将输出性能摘要：

=== Tiny-DL-Inference 性能基准测试 ===

--- 基准测试 1：内核融合 ---
分离执行：X.XXms
融合执行：X.XXms
加速比：X.XXx
内存流量减少：~67%

--- 基准测试 2：内存布局比较 ---
NCHW 布局：X.XXms
NHWC 布局：X.XXms
NHWC 优势：X.X%

--- 性能摘要 ---
内核融合提供 X.XXx 加速
内存布局优化提升性能
组合优化显著减少推理时间

下一步

查看 MNIST 示例了解完整的推理流程
学习内存布局了解优化策略
阅读 API 参考获取详细的算子文档

性能基准测试 ​

概述 ​

环境准备 ​

导入 ​

初始化 GPU 上下文 ​

基准测试 1：内核融合 ​

创建测试张量 ​

定义卷积参数 ​

测量分离执行 ​

测量融合执行 ​

计算加速比 ​

为什么融合有帮助 ​

基准测试 2：内存布局比较 ​

创建两种布局的张量 ​

测量 NCHW 性能 ​

测量 NHWC 性能 ​

比较结果 ​

布局性能说明 ​

通用基准测试模式 ​

使用示例 ​

基准测试工具 ​

Benchmark 类 ​

关键方法 ​

解释结果 ​

加速比 ​

内存流量减少 ​

统计显著性 ​

清理 ​

运行基准测试 ​

下一步 ​

性能基准测试

概述

环境准备

导入

初始化 GPU 上下文

基准测试 1：内核融合

创建测试张量

定义卷积参数

测量分离执行

测量融合执行

计算加速比

为什么融合有帮助

基准测试 2：内存布局比较

创建两种布局的张量

测量 NCHW 性能

测量 NHWC 性能

比较结果

布局性能说明

通用基准测试模式

使用示例

基准测试工具

Benchmark 类

关键方法

解释结果

加速比

内存流量减少

统计显著性

清理

运行基准测试

下一步