Operators

Complete guide to all neural network operators in Tiny-DL-Inference.

Operator Overview

Operator	Description	WGSL Implementation
ReLU	Rectified Linear Unit	Element-wise `max(0, x)`
Softmax	Normalized exponential	Numerically stable softmax
MaxPool	2D max pooling	Sliding window maximum
Conv2d	2D convolution	Direct convolution algorithm
Conv2dBiasReLU	Fused Conv+Bias+ReLU	Single-kernel optimization
Flatten	Tensor reshaping	Zero-copy view operation
Dense	Fully connected	Matrix multiplication

ReLUOperator

Rectified Linear Unit activation function.

Description

Sets all negative values to zero, keeping positive values unchanged.

Formula

f(x) = max(0, x)

Usage

typescript

import { ReLUOperator } from 'tiny-dl-inference';

const relu = new ReLUOperator(context);
const output = await relu.forward([input]);

WGSL Implementation

wgsl

@compute @workgroup_size(256)
fn main(@builtin(global_invocation_id) global_id: vec3<u32>) {
  let idx = global_id.x;
  if (idx >= numElements) { return; }
  
  let x = input[idx];
  output[idx] = select(0.0, x, x > 0.0);
}

Performance

Memory: 1 read + 1 write per element
Compute: 1 comparison per element
Optimal for: All element sizes

SoftmaxOperator

Converts logits to probability distribution.

Description

Applies exponential normalization along a specified axis.

Formula

softmax(x_i) = exp(x_i - max(x)) / sum(exp(x_j - max(x)))

The max(x) subtraction ensures numerical stability.

Parameters

Name	Type	Default	Description
`axis`	`number`	-1 (last)	Axis along which to compute softmax

Usage

typescript

import { SoftmaxOperator } from 'tiny-dl-inference';

const softmax = new SoftmaxOperator(context);

// Softmax along last dimension
const output = await softmax.forward([logits], { axis: -1 });

// Softmax along specific dimension
const output = await softmax.forward([logits], { axis: 1 });

WGSL Implementation

wgsl

// Two-pass algorithm for numerical stability
// Pass 1: Find maximum value
var maxVal = -3.402823e+38f;
for (var i = 0u; i < size; i = i + 1u) {
  maxVal = max(maxVal, input[i]);
}

// Pass 2: Compute exp(x - max) and sum
var sum = 0.0;
for (var i = 0u; i < size; i = i + 1u) {
  let expVal = exp(input[i] - maxVal);
  output[i] = expVal;
  sum = sum + expVal;
}

// Pass 3: Normalize
for (var i = 0u; i < size; i = i + 1u) {
  output[i] = output[i] / sum;
}

Output Properties

All values in range [0, 1]
Sum of values equals 1.0
Preserves relative ordering

MaxPoolOperator

2D max pooling layer for downsampling.

Description

Partitions input into pooling regions and outputs the maximum value of each region.

Parameters

Name	Type	Default	Description
`poolSize`	`number`	2	Size of pooling window (square)
`stride`	`number`	poolSize	Step size between pools

Output Shape

For input [N, C, H, W]:

output_height = floor((H - poolSize) / stride) + 1
output_width = floor((W - poolSize) / stride) + 1
output_shape = [N, C, output_height, output_width]

Usage

typescript

import { MaxPoolOperator } from 'tiny-dl-inference';

const maxpool = new MaxPoolOperator(context);

// 2x2 pooling with stride 2
const output = await maxpool.forward([input], { poolSize: 2, stride: 2 });

// 3x3 pooling with stride 1
const output = await maxpool.forward([input], { poolSize: 3, stride: 1 });

WGSL Implementation

wgsl

for (var ph = 0u; ph < outH; ph = ph + 1u) {
  for (var pw = 0u; pw < outW; pw = pw + 1u) {
    var maxVal = -3.402823e+38f;
    
    for (var kh = 0u; kh < poolSize; kh = kh + 1u) {
      for (var kw = 0u; kw < poolSize; kw = kw + 1u) {
        let h = ph * stride + kh;
        let w = pw * stride + kw;
        let idx = ((n * C + c) * H + h) * W + w;
        maxVal = max(maxVal, input[idx]);
      }
    }
    
    let outIdx = ((n * C + c) * outH + ph) * outW + pw;
    output[outIdx] = maxVal;
  }
}

Performance

Memory: Sequential access pattern
Compute: O(poolSize²) comparisons per output
Optimal for: poolSize ∈

Conv2dOperator

2D convolution layer for feature extraction.

Description

Applies sliding dot products between kernels and input regions.

Parameters

Name	Type	Default	Description
`channels`	`number`	required	Number of output channels (K)
`kernelSize`	`number`	3	Size of convolution kernel (square)
`stride`	`number`	1	Step size between convolutions
`padding`	`number`	0	Zero-padding size

Input/Output Shapes

Input: [N, C, H, W]
Weight: [K, C, kH, kW]
Bias (optional): [K]

Output: [N, K, outH, outW]

where:

outH = floor((H + 2*padding - kH) / stride) + 1
outW = floor((W + 2*padding - kW) / stride) + 1

Usage

typescript

import { Conv2dOperator } from 'tiny-dl-inference';

const conv2d = new Conv2dOperator(context);

const output = await conv2d.forward([input, weight, bias], {
  channels: 32,
  kernelSize: 3,
  stride: 1,
  padding: 1
});

WGSL Implementation

wgsl

// For each output position
for (var oh = 0u; oh < outH; oh = oh + 1u) {
  for (var ow = 0u; ow < outW; ow = ow + 1u) {
    var sum = 0.0;
    
    // Convolve over all input channels and kernel
    for (var ic = 0u; ic < C; ic = ic + 1u) {
      for (var kh = 0u; kh < kH; kh = kh + 1u) {
        for (var kw = 0u; kw < kW; kw = kw + 1u) {
          let ih = oh * stride + kh - padding;
          let iw = ow * stride + kw - padding;
          
          // Check bounds
          if (ih >= 0 && ih < H && iw >= 0 && iw < W) {
            let inputVal = input[((n * C + ic) * H + ih) * W + iw];
            let weightVal = weight[((oc * C + ic) * kH + kh) * kW + kw];
            sum = sum + inputVal * weightVal;
          }
        }
      }
    }
    
    // Add bias
    sum = sum + bias[oc];
    
    let outIdx = ((n * K + oc) * outH + oh) * outW + ow;
    output[outIdx] = sum;
  }
}

Performance Characteristics

Kernel Size	Relative Speed	Use Cases
1×1	Fast	Point-wise transformations
3×3	Standard	Standard convolutions
5×5	Slower	Larger receptive fields
7×7	Slowest	Initial layers

Conv2dBiasReLUOperator

Fused Conv2d + Bias + ReLU for optimal performance.

Description

Combines three operations into a single kernel, eliminating intermediate memory traffic.

Formula

output = ReLU(Conv2d(input, weight) + bias)
       = max(0, Conv2d(input, weight) + bias)

Performance Improvement

Metric	Separate Ops	Fused	Improvement
Memory Operations	6	2	3× reduction
Kernel Launches	3	1	3× reduction
Memory Traffic	High	Low	Significant

Usage

typescript

import { Conv2dBiasReLUOperator } from 'tiny-dl-inference';

const fused = new Conv2dBiasReLUOperator(context);

const output = await fused.forward([input, weight, bias], {
  channels: 32,
  kernelSize: 3,
  stride: 1,
  padding: 1
});

WGSL Implementation

The key difference from separate Conv2d:

wgsl

// ... convolution computation ...
var sum = 0.0;
// [convolution loops]

// Add bias and ReLU in the same kernel
sum = sum + bias[oc];
sum = select(0.0, sum, sum > 0.0);  // ReLU

output[outIdx] = sum;

When to Use

✅ Use fused operator for:

Conv→Bias→ReLU sequences
Production inference
Memory-constrained environments

❌ Don't use when:

You need intermediate Conv2d output
Debugging individual operations
Exploring different activation functions

FlattenOperator

Reshapes tensor while preserving batch dimension.

Description

Flattens all dimensions except the first (batch) dimension.

Transformation

Input:  [N, C, H, W]
Output: [N, C*H*W]

Example:
[1, 3, 224, 224] → [1, 150528]

Implementation Detail

Uses zero-copy view - no GPU memory movement:

typescript

// Internally uses reshape()
const flat = input.reshape([N, C * H * W]);

Usage

typescript

import { FlattenOperator } from 'tiny-dl-inference';

const flatten = new FlattenOperator(context);

// Flatten conv output for dense layer
const output = await flatten.forward([convOutput]);
// Shape: [batch, channels*height*width]

Performance

Memory: Zero additional allocation
Compute: No GPU computation
Time: O(1) - instant

DenseOperator

Fully connected (dense) layer.

Description

Matrix multiplication with learned weights and optional bias.

Formula

output = input @ weight.T + bias

where:
- input: [N, in_features]
- weight: [out_features, in_features]
- bias: [out_features]
- output: [N, out_features]

Parameters

Name	Type	Default	Description
`units`	`number`	required	Number of output features

Usage

typescript

import { DenseOperator } from 'tiny-dl-inference';

const dense = new DenseOperator(context);

const output = await dense.forward([input, weight, bias], {
  units: 128
});

WGSL Implementation

wgsl

for (var n = 0u; n < N; n = n + 1u) {
  for (var outIdx = 0u; outIdx < outFeatures; outIdx = outIdx + 1u) {
    var sum = 0.0;
    
    // Dot product
    for (var inIdx = 0u; inIdx < inFeatures; inIdx = inIdx + 1u) {
      let inputVal = input[n * inFeatures + inIdx];
      let weightVal = weight[outIdx * inFeatures + inIdx];
      sum = sum + inputVal * weightVal;
    }
    
    // Add bias and store
    output[n * outFeatures + outIdx] = sum + bias[outIdx];
  }
}

Performance Tips

Large weight matrices benefit from workgroup-level tiling
Consider quantization for production inference
Batch multiple inputs for better GPU utilization

Operator Comparison

Operator	FLOPs	Memory	Typical Use
ReLU	O(N)	O(N)	Activation
Softmax	O(N)	O(N)	Final classification
MaxPool	O(N×k²)	O(N)	Downsampling
Conv2d	O(N×C×K×k²)	O(N×K)	Feature extraction
Conv2dBiasReLU	O(N×C×K×k²)	O(N×K)	Conv layers
Flatten	O(1)	O(1)	Shape transformation
Dense	O(N×I×O)	O(N×O)	Classification

Legend: N = batch, C = channels, K = output channels, k = kernel size, I = input features, O = output features

Custom Operators

Template for Custom Operator

typescript

import { Operator, OperatorParams, Tensor, TensorShape } from 'tiny-dl-inference';

class CustomOperator extends Operator {
  protected compileShader(): string {
    return `
      @group(0) @binding(0) var<storage, read_write> output: array<f32>;
      @group(0) @binding(1) var<storage, read> input: array<f32>;
      
      @compute @workgroup_size(256)
      fn main(@builtin(global_invocation_id) global_id: vec3<u32>) {
        let idx = global_id.x;
        if (idx >= arrayLength(&input)) { return; }
        
        // Your computation here
        output[idx] = input[idx] * 2.0;
      }
    `;
  }
  
  computeOutputShape(inputShape: TensorShape): TensorShape {
    // Return the output shape
    return inputShape;
  }
  
  async forward(inputs: Tensor[], params?: OperatorParams): Promise<Tensor> {
    const input = inputs[0];
    const outputShape = this.computeOutputShape(input.shape);
    const output = new Tensor(this.context, outputShape, { layout: input.layout });
    
    this.ensureInitialized();
    
    const encoder = this.context.createCommandEncoder();
    const pass = encoder.beginComputePass();
    pass.setPipeline(this.pipeline!);
    
    const bindGroup = this.context.createBindGroup({
      layout: this.bindGroupLayout!,
      entries: [
        { binding: 0, resource: { buffer: output.buffer } },
        { binding: 1, resource: { buffer: input.buffer } }
      ]
    });
    
    pass.setBindGroup(0, bindGroup);
    pass.dispatchWorkgroups(Math.ceil(output.size / 256));
    pass.end();
    
    this.context.submit([encoder.finish()]);
    
    return output;
  }
}

Best Practices

Use appropriate workgroup sizes (256 is a good default)
Check bounds to avoid out-of-bounds access
Minimize divergence within workgroups
Reuse buffers when possible
Profile your custom operators

Operators ​

Operator Overview ​

ReLUOperator ​

Description ​

Formula ​

Usage ​

WGSL Implementation ​

Performance ​

SoftmaxOperator ​

Description ​

Formula ​

Parameters ​

Usage ​

WGSL Implementation ​

Output Properties ​

MaxPoolOperator ​

Description ​

Parameters ​

Output Shape ​

Usage ​

WGSL Implementation ​

Performance ​

Conv2dOperator ​

Description ​

Parameters ​

Input/Output Shapes ​

Usage ​

WGSL Implementation ​

Performance Characteristics ​

Conv2dBiasReLUOperator ​

Description ​

Formula ​

Performance Improvement ​

Usage ​

WGSL Implementation ​

When to Use ​

FlattenOperator ​

Description ​

Transformation ​

Implementation Detail ​

Usage ​

Performance ​

DenseOperator ​

Description ​

Formula ​

Parameters ​

Usage ​

WGSL Implementation ​

Performance Tips ​

Operator Comparison ​

Custom Operators ​

Template for Custom Operator ​

Best Practices ​

Operators

Operator Overview

ReLUOperator

Description

Formula

Usage

WGSL Implementation

Performance

SoftmaxOperator

Description

Formula

Parameters

Usage

WGSL Implementation

Output Properties

MaxPoolOperator

Description

Parameters

Output Shape

Usage

WGSL Implementation

Performance

Conv2dOperator

Description

Parameters

Input/Output Shapes

Usage

WGSL Implementation

Performance Characteristics

Conv2dBiasReLUOperator

Description

Formula

Performance Improvement

Usage

WGSL Implementation

When to Use

FlattenOperator

Description

Transformation

Implementation Detail

Usage

Performance

DenseOperator

Description

Formula

Parameters

Usage

WGSL Implementation

Performance Tips

Operator Comparison

Custom Operators

Template for Custom Operator

Best Practices