Kernel Fusion
Understanding and using kernel fusion for performance optimization.
What is Kernel Fusion?
Kernel fusion combines multiple operations into a single GPU kernel, reducing memory traffic and improving performance.
Why Fusion Matters
Without Fusion
Conv2d → Write → Read → Bias → Write → Read → ReLU → Write
↑ ↑ ↑
6 memory operations1
2
3
2
3
With Fusion
Conv2d+Bias+ReLU → Write
↑
2 memory operations (read input, write output)1
2
3
2
3
Result: 3× memory bandwidth reduction!
Available Fused Operators
Conv2dBiasReLU
Combines Conv2d, Bias addition, and ReLU activation:
typescript
import { Conv2dBiasReLU } from 'tiny-dl-inference';
const fusedOp = new Conv2dBiasReLU(context);
const output = await fusedOp.forward(
[input, weights, bias],
{
stride: [1, 1],
padding: [1, 1]
}
);1
2
3
4
5
6
7
8
9
10
11
2
3
4
5
6
7
8
9
10
11
Equivalence
typescript
// Fused operation
const fusedOutput = await fusedOp.forward([input, weights, bias], params);
// Sequential operations (equivalent)
const conv = await conv2d.forward([input, weights], params);
const biased = await bias.forward([conv, biasTensor], params);
const activated = await relu.forward([biased], params);
// fusedOutput ≈ activated1
2
3
4
5
6
7
8
9
2
3
4
5
6
7
8
9
When to Use Fusion
Benefits
| Scenario | Improvement |
|---|---|
| Memory-bound models | 2-3× speedup |
| Sequential Conv+Bias+ReLU | Significant |
| Large feature maps | Most benefit |
When NOT to Fuse
- Operations with different data dependencies
- When intermediate results are needed elsewhere
- When operators don't naturally sequence
Performance Comparison
typescript
import { Benchmark } from 'tiny-dl-inference';
const bench = new Benchmark(context);
// Measure fused
const fusedTime = await bench.measureOperation(async () => {
return await fusedOp.forward([input, weights, bias], params);
});
// Measure sequential
const seqTime = await bench.measureOperation(async () => {
const c = await conv2d.forward([input, weights], params);
const b = await bias.forward([c, biasTensor], params);
return await relu.forward([b], params);
});
console.log(`Fused: ${fusedTime}ms, Sequential: ${seqTime}ms`);
console.log(`Speedup: ${seqTime / fusedTime}×`);1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Implementation Details
Shader Structure
wgsl
@compute @workgroup_size(256)
fn main(@builtin(global_invocation_id) id: vec3<u32>) {
// Conv2d computation
let convResult = computeConv2d(id);
// Bias addition
let biasedResult = convResult + bias[channel];
// ReLU activation
let output = max(0.0, biasedResult);
// Single write to memory
outputBuffer[id.x] = output;
}1
2
3
4
5
6
7
8
9
10
11
12
13
14
2
3
4
5
6
7
8
9
10
11
12
13
14
Memory Savings
| Metric | Sequential | Fused |
|---|---|---|
| Memory reads | 3 | 1 |
| Memory writes | 3 | 1 |
| Intermediate buffers | 2 | 0 |
Best Practices
- Use fused operators when available - They're always faster
- Structure models for fusion - Place Conv2d → Bias → ReLU together
- Profile your model - Measure actual speedup
API Reference
See the Operators API Reference for fused operator documentation.