Optimization Guide

Performance optimization techniques for Tiny-DL-Inference.

Overview

Tiny-DL-Inference implements several optimization strategies to maximize inference performance on WebGPU:

Kernel Fusion - Reducing memory traffic
Memory Layout Optimization - NCHW vs NHWC
Zero-copy Operations - Efficient tensor views
Im2Col Algorithm - Convolution optimization
Workgroup Tuning - GPU occupancy optimization

Kernel Fusion

Concept

Kernel fusion combines multiple operations into a single GPU kernel, eliminating intermediate memory reads/writes.

Example: Conv2d + Bias + ReLU

Without Fusion:

Memory: Read → Conv2d → Write (1)
        Read → Bias → Write (2)
        Read → ReLU → Write (3)
        
Total: 6 memory operations

With Fusion:

Memory: Read → Conv2d+Bias+ReLU → Write

Total: 2 memory operations

Result: 3× memory bandwidth reduction

When to Use Fused Operators

Scenario	Recommendation
Conv → Bias → ReLU	✅ Use `Conv2dBiasReLUOperator`
Production inference	✅ Always use fusion
Memory-constrained	✅ Critical optimization
Debugging	❌ Use separate operators
Custom activations	❌ Fuse manually if possible

Implementation

typescript

// ❌ Inefficient: 3 separate kernels
const conv = new Conv2dOperator(context);
const bias = ...; // manual bias addition
const relu = new ReLUOperator(context);

// ✅ Efficient: 1 fused kernel
const fused = new Conv2dBiasReLUOperator(context);
const output = await fused.forward([input, weight, bias], params);

Memory Layout

NCHW vs NHWC

Two common memory layouts for image data:

Layout	Format	Framework	Cache Friendly
NCHW	[N, C, H, W]	PyTorch	Channel operations
NHWC	[N, H, W, C]	TensorFlow	Spatial operations

Current Implementation

Conv2d and MaxPool: NCHW only

This choice optimizes for:

Sequential channel access in convolution
Cache locality for filter operations

Layout Conversion

When working with different formats:

typescript

// Convert NHWC to NCHW for processing
const nchwTensor = await nhwcTensor.convertLayout('NCHW');

// Process...
const output = await conv2d.forward([nchwTensor, weight], params);

// Convert back if needed
const result = await output.convertLayout('NHWC');

Performance Impact

Operation	Cost
Layout conversion	High (full tensor copy)
NCHW Conv2d	Optimal
NHWC Conv2d	Not supported (convert first)

Recommendation: Keep data in NCHW throughout pipeline when possible.

Zero-copy Operations

Tensor Views

The reshape() method creates a view that shares the underlying GPU buffer:

typescript

const tensor = new Tensor(context, [1, 3, 224, 224]);

// Zero-copy reshape
const flat = tensor.reshape([1, 150528]);
// flat.buffer === tensor.buffer (same GPU buffer)

Benefits

Metric	Copy	View
Time	O(N)	O(1)
Memory	2×	1×
GPU Overhead	High	None

Use Cases

Flatten Operation

typescript

// Flatten layer (zero overhead)
const flat = convOutput.reshape([batch, -1]);

Shape Adaptation

typescript

// Adapt tensor for different layer expectations
const adapted = tensor.reshape([newBatch, newChannels, newH, newW]);

Batch Manipulation

typescript

// Reshape batch dimensions
const batched = tensor.reshape([batchSize, -1, channels]);

Limitations

View tensors cannot be resized beyond original buffer
Views share lifecycle with parent (destroy parent = lose data)
Layout conversions require actual data movement

Im2Col Algorithm

Concept

Im2Col transforms convolution into matrix multiplication (GEMM), which can be highly optimized.

Transformation

Input:  [N, C, H, W]
         ↓ Im2Col
Matrix: [N*outH*outW, C*kH*kW]

Weight: [K, C, kH, kW]
         ↓ Reshape
Matrix: [K, C*kH*kW]

Output = GEMM(Weight_matrix, Im2Col(Input))
        [K, N*outH*outW]
         ↓ Reshape
        [N, K, outH, outW]

Implementation Utilities

typescript

import { im2col } from 'tiny-dl-inference';

// Convert image to column format
const col = im2col(input, {
  kernelHeight: 3,
  kernelWidth: 3,
  stride: 1,
  padding: 1
});

When to Use

Scenario	Recommendation
Large kernels (5×5, 7×7)	Consider Im2Col
Custom GEMM implementation	Im2Col required
Standard 3×3 convolutions	Direct convolution (current)
Grouped convolutions	Specialized algorithm

Trade-offs

Factor	Direct Conv	Im2Col
Memory	Lower	Higher (column buffer)
Small kernels	Faster	Overhead
Large kernels	Slower	Faster
Implementation	Complex	Simpler

Deferred Resource Destruction

Problem

Premature buffer destruction causes crashes:

typescript

// ❌ Dangerous: buffer destroyed before GPU uses it
commandEncoder.copyBufferToBuffer(buffer, ...);
device.queue.submit([commandEncoder.finish()]);
buffer.destroy(); // May crash!

Solution

deferDestroy() schedules destruction after GPU work completes:

typescript

// ✅ Safe: destruction deferred
commandEncoder.copyBufferToBuffer(buffer, ...);
device.queue.submit([commandEncoder.finish()]);
context.deferDestroy(buffer); // Safe!

Implementation

typescript

class GPUContext {
  private pendingCleanup: Set<Promise<void>> = new Set();
  
  deferDestroy(buffer: GPUBuffer | null): void {
    if (!buffer) return;
    
    const cleanup = this.waitForSubmittedWork()
      .then(() => buffer.destroy())
      .catch(() => { /* ignore */ });
    
    this.pendingCleanup.add(cleanup);
    cleanup.finally(() => this.pendingCleanup.delete(cleanup));
  }
  
  async sync(): Promise<void> {
    await this.waitForSubmittedWork();
    await Promise.allSettled([...this.pendingCleanup]);
    this.pendingCleanup.clear();
  }
}

Benchmarking

Built-in Benchmark Utility

typescript

import { Benchmark } from 'tiny-dl-inference';

const benchmark = new Benchmark();

const result = await benchmark.measureOperator(
  operator,      // Operator instance
  inputs,        // Input tensors
  params,        // Operator parameters
  100            // Number of iterations
);

console.log({
  mean: result.meanMs,      // Average execution time
  stdDev: result.stdDevMs,  // Standard deviation
  min: result.minMs,        // Minimum time
  max: result.maxMs         // Maximum time
});

Interpreting Results

Metric	Healthy Range	Action if High
Mean	Consistent	Baseline
StdDev	< 10% of mean	Check for variations
Min	Close to mean	Good
Max	< 2× mean	Check for outliers

Profiling Tips

Warmup: Run a few iterations before measuring

typescript

for (let i = 0; i < 5; i++) {
  await operator.forward(inputs, params);
}
// Now measure

Isolate Variables: Test one operator at a time
Multiple Runs: Average across multiple benchmark sessions
Vary Input Sizes: Test with realistic data sizes

Performance Checklist

Before Deployment

[ ] Use fused operators where applicable
[ ] Minimize layout conversions
[ ] Profile with realistic inputs
[ ] Test on target hardware
[ ] Verify memory cleanup
[ ] Benchmark against baselines

Runtime Optimizations

[ ] Reuse tensors when possible
[ ] Batch multiple inferences
[ ] Use deferDestroy() for temporaries
[ ] Avoid unnecessary downloads
[ ] Pre-allocate buffers

Code Patterns

typescript

// ✅ Good: Pre-allocate output
const output = new Tensor(context, outputShape);
for (const input of inputs) {
  await operator.forward([input, output], params);
}

// ❌ Bad: Allocate in loop
for (const input of inputs) {
  const output = new Tensor(context, outputShape); // Slow!
  await operator.forward([input], params);
}

Hardware-Specific Tuning

Workgroup Sizes

Default workgroup size is 256. Optimal size varies by GPU:

GPU Architecture	Optimal Workgroup Size
NVIDIA (Compute 7.0+)	256 or 512
AMD RDNA	128 or 256
Intel Xe	256
Apple M1/M2	256 or 512

Testing Workgroup Performance

typescript

// Benchmark different workgroup sizes
for (const wgSize of [128, 256, 512]) {
  // Modify shader to use wgSize
  const result = await benchmarkWithWorkgroupSize(wgSize);
  console.log(`Workgroup ${wgSize}: ${result.meanMs}ms`);
}

Optimization Summary

Technique	Impact	Effort
Kernel Fusion	High	Low (use fused ops)
Zero-copy Views	High	Low (reshape vs copy)
Layout Consistency	Medium	Medium
Deferred Cleanup	High	Low (automatic)
Buffer Reuse	Medium	Medium
Workgroup Tuning	Medium-High	High

Priority Order:

Use kernel fusion
Leverage zero-copy views
Keep consistent layouts
Profile and tune workgroups

Optimization Guide ​

Overview ​

Kernel Fusion ​

Concept ​

Example: Conv2d + Bias + ReLU ​

When to Use Fused Operators ​

Implementation ​

Memory Layout ​

NCHW vs NHWC ​

Current Implementation ​

Layout Conversion ​

Performance Impact ​

Zero-copy Operations ​

Tensor Views ​

Benefits ​

Use Cases ​

Limitations ​

Im2Col Algorithm ​

Concept ​

Transformation ​

Implementation Utilities ​

When to Use ​

Trade-offs ​

Deferred Resource Destruction ​

Problem ​

Solution ​

Implementation ​

Benchmarking ​

Built-in Benchmark Utility ​

Interpreting Results ​

Profiling Tips ​

Performance Checklist ​

Before Deployment ​

Runtime Optimizations ​

Code Patterns ​

Hardware-Specific Tuning ​

Workgroup Sizes ​

Testing Workgroup Performance ​

Optimization Summary ​

Optimization Guide

Overview

Kernel Fusion

Concept

Example: Conv2d + Bias + ReLU

When to Use Fused Operators

Implementation

Memory Layout

NCHW vs NHWC

Current Implementation

Layout Conversion

Performance Impact

Zero-copy Operations

Tensor Views

Benefits

Use Cases

Limitations

Im2Col Algorithm

Concept

Transformation

Implementation Utilities

When to Use

Trade-offs

Deferred Resource Destruction

Problem

Solution

Implementation

Benchmarking

Built-in Benchmark Utility

Interpreting Results

Profiling Tips

Performance Checklist

Before Deployment

Runtime Optimizations

Code Patterns

Hardware-Specific Tuning

Workgroup Sizes

Testing Workgroup Performance

Optimization Summary