Optimization Guide
Performance optimization techniques for Tiny-DL-Inference.
Overview
Tiny-DL-Inference implements several optimization strategies to maximize inference performance on WebGPU:
- Kernel Fusion - Reducing memory traffic
- Memory Layout Optimization - NCHW vs NHWC
- Zero-copy Operations - Efficient tensor views
- Im2Col Algorithm - Convolution optimization
- Workgroup Tuning - GPU occupancy optimization
Kernel Fusion
Concept
Kernel fusion combines multiple operations into a single GPU kernel, eliminating intermediate memory reads/writes.
Example: Conv2d + Bias + ReLU
Without Fusion:
Memory: Read → Conv2d → Write (1)
Read → Bias → Write (2)
Read → ReLU → Write (3)
Total: 6 memory operations2
3
4
5
With Fusion:
Memory: Read → Conv2d+Bias+ReLU → Write
Total: 2 memory operations2
3
Result: 3× memory bandwidth reduction
When to Use Fused Operators
| Scenario | Recommendation |
|---|---|
| Conv → Bias → ReLU | ✅ Use Conv2dBiasReLUOperator |
| Production inference | ✅ Always use fusion |
| Memory-constrained | ✅ Critical optimization |
| Debugging | ❌ Use separate operators |
| Custom activations | ❌ Fuse manually if possible |
Implementation
// ❌ Inefficient: 3 separate kernels
const conv = new Conv2dOperator(context);
const bias = ...; // manual bias addition
const relu = new ReLUOperator(context);
// ✅ Efficient: 1 fused kernel
const fused = new Conv2dBiasReLUOperator(context);
const output = await fused.forward([input, weight, bias], params);2
3
4
5
6
7
8
Memory Layout
NCHW vs NHWC
Two common memory layouts for image data:
| Layout | Format | Framework | Cache Friendly |
|---|---|---|---|
| NCHW | [N, C, H, W] | PyTorch | Channel operations |
| NHWC | [N, H, W, C] | TensorFlow | Spatial operations |
Current Implementation
Conv2d and MaxPool: NCHW only
This choice optimizes for:
- Sequential channel access in convolution
- Cache locality for filter operations
Layout Conversion
When working with different formats:
// Convert NHWC to NCHW for processing
const nchwTensor = await nhwcTensor.convertLayout('NCHW');
// Process...
const output = await conv2d.forward([nchwTensor, weight], params);
// Convert back if needed
const result = await output.convertLayout('NHWC');2
3
4
5
6
7
8
Performance Impact
| Operation | Cost |
|---|---|
| Layout conversion | High (full tensor copy) |
| NCHW Conv2d | Optimal |
| NHWC Conv2d | Not supported (convert first) |
Recommendation: Keep data in NCHW throughout pipeline when possible.
Zero-copy Operations
Tensor Views
The reshape() method creates a view that shares the underlying GPU buffer:
const tensor = new Tensor(context, [1, 3, 224, 224]);
// Zero-copy reshape
const flat = tensor.reshape([1, 150528]);
// flat.buffer === tensor.buffer (same GPU buffer)2
3
4
5
Benefits
| Metric | Copy | View |
|---|---|---|
| Time | O(N) | O(1) |
| Memory | 2× | 1× |
| GPU Overhead | High | None |
Use Cases
Flatten Operation
typescript// Flatten layer (zero overhead) const flat = convOutput.reshape([batch, -1]);1
2Shape Adaptation
typescript// Adapt tensor for different layer expectations const adapted = tensor.reshape([newBatch, newChannels, newH, newW]);1
2Batch Manipulation
typescript// Reshape batch dimensions const batched = tensor.reshape([batchSize, -1, channels]);1
2
Limitations
- View tensors cannot be resized beyond original buffer
- Views share lifecycle with parent (destroy parent = lose data)
- Layout conversions require actual data movement
Im2Col Algorithm
Concept
Im2Col transforms convolution into matrix multiplication (GEMM), which can be highly optimized.
Transformation
Input: [N, C, H, W]
↓ Im2Col
Matrix: [N*outH*outW, C*kH*kW]
Weight: [K, C, kH, kW]
↓ Reshape
Matrix: [K, C*kH*kW]
Output = GEMM(Weight_matrix, Im2Col(Input))
[K, N*outH*outW]
↓ Reshape
[N, K, outH, outW]2
3
4
5
6
7
8
9
10
11
12
Implementation Utilities
import { im2col } from 'tiny-dl-inference';
// Convert image to column format
const col = im2col(input, {
kernelHeight: 3,
kernelWidth: 3,
stride: 1,
padding: 1
});2
3
4
5
6
7
8
9
When to Use
| Scenario | Recommendation |
|---|---|
| Large kernels (5×5, 7×7) | Consider Im2Col |
| Custom GEMM implementation | Im2Col required |
| Standard 3×3 convolutions | Direct convolution (current) |
| Grouped convolutions | Specialized algorithm |
Trade-offs
| Factor | Direct Conv | Im2Col |
|---|---|---|
| Memory | Lower | Higher (column buffer) |
| Small kernels | Faster | Overhead |
| Large kernels | Slower | Faster |
| Implementation | Complex | Simpler |
Deferred Resource Destruction
Problem
Premature buffer destruction causes crashes:
// ❌ Dangerous: buffer destroyed before GPU uses it
commandEncoder.copyBufferToBuffer(buffer, ...);
device.queue.submit([commandEncoder.finish()]);
buffer.destroy(); // May crash!2
3
4
Solution
deferDestroy() schedules destruction after GPU work completes:
// ✅ Safe: destruction deferred
commandEncoder.copyBufferToBuffer(buffer, ...);
device.queue.submit([commandEncoder.finish()]);
context.deferDestroy(buffer); // Safe!2
3
4
Implementation
class GPUContext {
private pendingCleanup: Set<Promise<void>> = new Set();
deferDestroy(buffer: GPUBuffer | null): void {
if (!buffer) return;
const cleanup = this.waitForSubmittedWork()
.then(() => buffer.destroy())
.catch(() => { /* ignore */ });
this.pendingCleanup.add(cleanup);
cleanup.finally(() => this.pendingCleanup.delete(cleanup));
}
async sync(): Promise<void> {
await this.waitForSubmittedWork();
await Promise.allSettled([...this.pendingCleanup]);
this.pendingCleanup.clear();
}
}2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Benchmarking
Built-in Benchmark Utility
import { Benchmark } from 'tiny-dl-inference';
const benchmark = new Benchmark();
const result = await benchmark.measureOperator(
operator, // Operator instance
inputs, // Input tensors
params, // Operator parameters
100 // Number of iterations
);
console.log({
mean: result.meanMs, // Average execution time
stdDev: result.stdDevMs, // Standard deviation
min: result.minMs, // Minimum time
max: result.maxMs // Maximum time
});2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Interpreting Results
| Metric | Healthy Range | Action if High |
|---|---|---|
| Mean | Consistent | Baseline |
| StdDev | < 10% of mean | Check for variations |
| Min | Close to mean | Good |
| Max | < 2× mean | Check for outliers |
Profiling Tips
Warmup: Run a few iterations before measuring
typescriptfor (let i = 0; i < 5; i++) { await operator.forward(inputs, params); } // Now measure1
2
3
4Isolate Variables: Test one operator at a time
Multiple Runs: Average across multiple benchmark sessions
Vary Input Sizes: Test with realistic data sizes
Performance Checklist
Before Deployment
- [ ] Use fused operators where applicable
- [ ] Minimize layout conversions
- [ ] Profile with realistic inputs
- [ ] Test on target hardware
- [ ] Verify memory cleanup
- [ ] Benchmark against baselines
Runtime Optimizations
- [ ] Reuse tensors when possible
- [ ] Batch multiple inferences
- [ ] Use
deferDestroy()for temporaries - [ ] Avoid unnecessary downloads
- [ ] Pre-allocate buffers
Code Patterns
// ✅ Good: Pre-allocate output
const output = new Tensor(context, outputShape);
for (const input of inputs) {
await operator.forward([input, output], params);
}
// ❌ Bad: Allocate in loop
for (const input of inputs) {
const output = new Tensor(context, outputShape); // Slow!
await operator.forward([input], params);
}2
3
4
5
6
7
8
9
10
11
Hardware-Specific Tuning
Workgroup Sizes
Default workgroup size is 256. Optimal size varies by GPU:
| GPU Architecture | Optimal Workgroup Size |
|---|---|
| NVIDIA (Compute 7.0+) | 256 or 512 |
| AMD RDNA | 128 or 256 |
| Intel Xe | 256 |
| Apple M1/M2 | 256 or 512 |
Testing Workgroup Performance
// Benchmark different workgroup sizes
for (const wgSize of [128, 256, 512]) {
// Modify shader to use wgSize
const result = await benchmarkWithWorkgroupSize(wgSize);
console.log(`Workgroup ${wgSize}: ${result.meanMs}ms`);
}2
3
4
5
6
Optimization Summary
| Technique | Impact | Effort |
|---|---|---|
| Kernel Fusion | High | Low (use fused ops) |
| Zero-copy Views | High | Low (reshape vs copy) |
| Layout Consistency | Medium | Medium |
| Deferred Cleanup | High | Low (automatic) |
| Buffer Reuse | Medium | Medium |
| Workgroup Tuning | Medium-High | High |
Priority Order:
- Use kernel fusion
- Leverage zero-copy views
- Keep consistent layouts
- Profile and tune workgroups