Operators
Complete guide to all neural network operators in Tiny-DL-Inference.
Operator Overview
| Operator | Description | WGSL Implementation |
|---|---|---|
| ReLU | Rectified Linear Unit | Element-wise max(0, x) |
| Softmax | Normalized exponential | Numerically stable softmax |
| MaxPool | 2D max pooling | Sliding window maximum |
| Conv2d | 2D convolution | Direct convolution algorithm |
| Conv2dBiasReLU | Fused Conv+Bias+ReLU | Single-kernel optimization |
| Flatten | Tensor reshaping | Zero-copy view operation |
| Dense | Fully connected | Matrix multiplication |
ReLUOperator
Rectified Linear Unit activation function.
Description
Sets all negative values to zero, keeping positive values unchanged.
Formula
f(x) = max(0, x)Usage
import { ReLUOperator } from 'tiny-dl-inference';
const relu = new ReLUOperator(context);
const output = await relu.forward([input]);2
3
4
WGSL Implementation
@compute @workgroup_size(256)
fn main(@builtin(global_invocation_id) global_id: vec3<u32>) {
let idx = global_id.x;
if (idx >= numElements) { return; }
let x = input[idx];
output[idx] = select(0.0, x, x > 0.0);
}2
3
4
5
6
7
8
Performance
- Memory: 1 read + 1 write per element
- Compute: 1 comparison per element
- Optimal for: All element sizes
SoftmaxOperator
Converts logits to probability distribution.
Description
Applies exponential normalization along a specified axis.
Formula
softmax(x_i) = exp(x_i - max(x)) / sum(exp(x_j - max(x)))The max(x) subtraction ensures numerical stability.
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
axis | number | -1 (last) | Axis along which to compute softmax |
Usage
import { SoftmaxOperator } from 'tiny-dl-inference';
const softmax = new SoftmaxOperator(context);
// Softmax along last dimension
const output = await softmax.forward([logits], { axis: -1 });
// Softmax along specific dimension
const output = await softmax.forward([logits], { axis: 1 });2
3
4
5
6
7
8
9
WGSL Implementation
// Two-pass algorithm for numerical stability
// Pass 1: Find maximum value
var maxVal = -3.402823e+38f;
for (var i = 0u; i < size; i = i + 1u) {
maxVal = max(maxVal, input[i]);
}
// Pass 2: Compute exp(x - max) and sum
var sum = 0.0;
for (var i = 0u; i < size; i = i + 1u) {
let expVal = exp(input[i] - maxVal);
output[i] = expVal;
sum = sum + expVal;
}
// Pass 3: Normalize
for (var i = 0u; i < size; i = i + 1u) {
output[i] = output[i] / sum;
}2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Output Properties
- All values in range [0, 1]
- Sum of values equals 1.0
- Preserves relative ordering
MaxPoolOperator
2D max pooling layer for downsampling.
Description
Partitions input into pooling regions and outputs the maximum value of each region.
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
poolSize | number | 2 | Size of pooling window (square) |
stride | number | poolSize | Step size between pools |
Output Shape
For input [N, C, H, W]:
output_height = floor((H - poolSize) / stride) + 1
output_width = floor((W - poolSize) / stride) + 1
output_shape = [N, C, output_height, output_width]2
3
Usage
import { MaxPoolOperator } from 'tiny-dl-inference';
const maxpool = new MaxPoolOperator(context);
// 2x2 pooling with stride 2
const output = await maxpool.forward([input], { poolSize: 2, stride: 2 });
// 3x3 pooling with stride 1
const output = await maxpool.forward([input], { poolSize: 3, stride: 1 });2
3
4
5
6
7
8
9
WGSL Implementation
for (var ph = 0u; ph < outH; ph = ph + 1u) {
for (var pw = 0u; pw < outW; pw = pw + 1u) {
var maxVal = -3.402823e+38f;
for (var kh = 0u; kh < poolSize; kh = kh + 1u) {
for (var kw = 0u; kw < poolSize; kw = kw + 1u) {
let h = ph * stride + kh;
let w = pw * stride + kw;
let idx = ((n * C + c) * H + h) * W + w;
maxVal = max(maxVal, input[idx]);
}
}
let outIdx = ((n * C + c) * outH + ph) * outW + pw;
output[outIdx] = maxVal;
}
}2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Performance
- Memory: Sequential access pattern
- Compute: O(poolSize²) comparisons per output
- Optimal for: poolSize ∈
Conv2dOperator
2D convolution layer for feature extraction.
Description
Applies sliding dot products between kernels and input regions.
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
channels | number | required | Number of output channels (K) |
kernelSize | number | 3 | Size of convolution kernel (square) |
stride | number | 1 | Step size between convolutions |
padding | number | 0 | Zero-padding size |
Input/Output Shapes
Input: [N, C, H, W]
Weight: [K, C, kH, kW]
Bias (optional): [K]
Output: [N, K, outH, outW]
where:
outH = floor((H + 2*padding - kH) / stride) + 1
outW = floor((W + 2*padding - kW) / stride) + 12
Usage
import { Conv2dOperator } from 'tiny-dl-inference';
const conv2d = new Conv2dOperator(context);
const output = await conv2d.forward([input, weight, bias], {
channels: 32,
kernelSize: 3,
stride: 1,
padding: 1
});2
3
4
5
6
7
8
9
10
WGSL Implementation
// For each output position
for (var oh = 0u; oh < outH; oh = oh + 1u) {
for (var ow = 0u; ow < outW; ow = ow + 1u) {
var sum = 0.0;
// Convolve over all input channels and kernel
for (var ic = 0u; ic < C; ic = ic + 1u) {
for (var kh = 0u; kh < kH; kh = kh + 1u) {
for (var kw = 0u; kw < kW; kw = kw + 1u) {
let ih = oh * stride + kh - padding;
let iw = ow * stride + kw - padding;
// Check bounds
if (ih >= 0 && ih < H && iw >= 0 && iw < W) {
let inputVal = input[((n * C + ic) * H + ih) * W + iw];
let weightVal = weight[((oc * C + ic) * kH + kh) * kW + kw];
sum = sum + inputVal * weightVal;
}
}
}
}
// Add bias
sum = sum + bias[oc];
let outIdx = ((n * K + oc) * outH + oh) * outW + ow;
output[outIdx] = sum;
}
}2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Performance Characteristics
| Kernel Size | Relative Speed | Use Cases |
|---|---|---|
| 1×1 | Fast | Point-wise transformations |
| 3×3 | Standard | Standard convolutions |
| 5×5 | Slower | Larger receptive fields |
| 7×7 | Slowest | Initial layers |
Conv2dBiasReLUOperator
Fused Conv2d + Bias + ReLU for optimal performance.
Description
Combines three operations into a single kernel, eliminating intermediate memory traffic.
Formula
output = ReLU(Conv2d(input, weight) + bias)
= max(0, Conv2d(input, weight) + bias)2
Performance Improvement
| Metric | Separate Ops | Fused | Improvement |
|---|---|---|---|
| Memory Operations | 6 | 2 | 3× reduction |
| Kernel Launches | 3 | 1 | 3× reduction |
| Memory Traffic | High | Low | Significant |
Usage
import { Conv2dBiasReLUOperator } from 'tiny-dl-inference';
const fused = new Conv2dBiasReLUOperator(context);
const output = await fused.forward([input, weight, bias], {
channels: 32,
kernelSize: 3,
stride: 1,
padding: 1
});2
3
4
5
6
7
8
9
10
WGSL Implementation
The key difference from separate Conv2d:
// ... convolution computation ...
var sum = 0.0;
// [convolution loops]
// Add bias and ReLU in the same kernel
sum = sum + bias[oc];
sum = select(0.0, sum, sum > 0.0); // ReLU
output[outIdx] = sum;2
3
4
5
6
7
8
9
When to Use
✅ Use fused operator for:
- Conv→Bias→ReLU sequences
- Production inference
- Memory-constrained environments
❌ Don't use when:
- You need intermediate Conv2d output
- Debugging individual operations
- Exploring different activation functions
FlattenOperator
Reshapes tensor while preserving batch dimension.
Description
Flattens all dimensions except the first (batch) dimension.
Transformation
Input: [N, C, H, W]
Output: [N, C*H*W]
Example:
[1, 3, 224, 224] → [1, 150528]2
3
4
5
Implementation Detail
Uses zero-copy view - no GPU memory movement:
// Internally uses reshape()
const flat = input.reshape([N, C * H * W]);2
Usage
import { FlattenOperator } from 'tiny-dl-inference';
const flatten = new FlattenOperator(context);
// Flatten conv output for dense layer
const output = await flatten.forward([convOutput]);
// Shape: [batch, channels*height*width]2
3
4
5
6
7
Performance
- Memory: Zero additional allocation
- Compute: No GPU computation
- Time: O(1) - instant
DenseOperator
Fully connected (dense) layer.
Description
Matrix multiplication with learned weights and optional bias.
Formula
output = input @ weight.T + bias
where:
- input: [N, in_features]
- weight: [out_features, in_features]
- bias: [out_features]
- output: [N, out_features]2
3
4
5
6
7
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
units | number | required | Number of output features |
Usage
import { DenseOperator } from 'tiny-dl-inference';
const dense = new DenseOperator(context);
const output = await dense.forward([input, weight, bias], {
units: 128
});2
3
4
5
6
7
WGSL Implementation
for (var n = 0u; n < N; n = n + 1u) {
for (var outIdx = 0u; outIdx < outFeatures; outIdx = outIdx + 1u) {
var sum = 0.0;
// Dot product
for (var inIdx = 0u; inIdx < inFeatures; inIdx = inIdx + 1u) {
let inputVal = input[n * inFeatures + inIdx];
let weightVal = weight[outIdx * inFeatures + inIdx];
sum = sum + inputVal * weightVal;
}
// Add bias and store
output[n * outFeatures + outIdx] = sum + bias[outIdx];
}
}2
3
4
5
6
7
8
9
10
11
12
13
14
15
Performance Tips
- Large weight matrices benefit from workgroup-level tiling
- Consider quantization for production inference
- Batch multiple inputs for better GPU utilization
Operator Comparison
| Operator | FLOPs | Memory | Typical Use |
|---|---|---|---|
| ReLU | O(N) | O(N) | Activation |
| Softmax | O(N) | O(N) | Final classification |
| MaxPool | O(N×k²) | O(N) | Downsampling |
| Conv2d | O(N×C×K×k²) | O(N×K) | Feature extraction |
| Conv2dBiasReLU | O(N×C×K×k²) | O(N×K) | Conv layers |
| Flatten | O(1) | O(1) | Shape transformation |
| Dense | O(N×I×O) | O(N×O) | Classification |
Legend: N = batch, C = channels, K = output channels, k = kernel size, I = input features, O = output features
Custom Operators
Template for Custom Operator
import { Operator, OperatorParams, Tensor, TensorShape } from 'tiny-dl-inference';
class CustomOperator extends Operator {
protected compileShader(): string {
return `
@group(0) @binding(0) var<storage, read_write> output: array<f32>;
@group(0) @binding(1) var<storage, read> input: array<f32>;
@compute @workgroup_size(256)
fn main(@builtin(global_invocation_id) global_id: vec3<u32>) {
let idx = global_id.x;
if (idx >= arrayLength(&input)) { return; }
// Your computation here
output[idx] = input[idx] * 2.0;
}
`;
}
computeOutputShape(inputShape: TensorShape): TensorShape {
// Return the output shape
return inputShape;
}
async forward(inputs: Tensor[], params?: OperatorParams): Promise<Tensor> {
const input = inputs[0];
const outputShape = this.computeOutputShape(input.shape);
const output = new Tensor(this.context, outputShape, { layout: input.layout });
this.ensureInitialized();
const encoder = this.context.createCommandEncoder();
const pass = encoder.beginComputePass();
pass.setPipeline(this.pipeline!);
const bindGroup = this.context.createBindGroup({
layout: this.bindGroupLayout!,
entries: [
{ binding: 0, resource: { buffer: output.buffer } },
{ binding: 1, resource: { buffer: input.buffer } }
]
});
pass.setBindGroup(0, bindGroup);
pass.dispatchWorkgroups(Math.ceil(output.size / 256));
pass.end();
this.context.submit([encoder.finish()]);
return output;
}
}2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
Best Practices
- Use appropriate workgroup sizes (256 is a good default)
- Check bounds to avoid out-of-bounds access
- Minimize divergence within workgroups
- Reuse buffers when possible
- Profile your custom operators