Architecture

System architecture and design principles of Tiny-DL-Inference.

Overview

Tiny-DL-Inference follows a layered architecture that separates concerns between device management, tensor operations, neural network operators, and high-level inference execution.

┌─────────────────────────────────────────────────────────────────┐
│                    Application Layer                             │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐  │
│  │ Model Loader│  │  Benchmark  │  │    Inference Runner     │  │
│  └─────────────┘  └─────────────┘  └─────────────────────────┘  │
├─────────────────────────────────────────────────────────────────┤
│                    Operator Layer                                │
│  ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌───────────────┐  │
│  │ Conv2d │ │MaxPool │ │  ReLU  │ │Softmax │ │Conv2d+Bias+ReLU│  │
│  └────────┘ └────────┘ └────────┘ └────────┘ └───────────────┘  │
├─────────────────────────────────────────────────────────────────┤
│                    Core Layer                                    │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐  │
│  │     Tensor      │  │  ShaderManager  │  │    GPUContext   │  │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘  │
├─────────────────────────────────────────────────────────────────┤
│                    Device Layer                                  │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │                    WebGPU Adapter/Device                     ││
│  └─────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘

Layer Descriptions

Device Layer

The foundation that interfaces with the GPU hardware.

Components:

GPUAdapter - Physical GPU selection
GPUDevice - Logical device with command queues
GPUQueue - Command submission

Responsibilities:

Hardware abstraction
Command buffer submission
Synchronization primitives

Core Layer

Provides fundamental abstractions for GPU programming.

GPUContext

Centralized resource management:

Device initialization and configuration
Buffer allocation and lifecycle
Command encoding and submission
Deferred resource destruction

typescript

class GPUContext {
  private adapter: GPUAdapter;
  private device: GPUDevice;
  private pendingCleanup: Set<Promise<void>>;
  
  // Lazy initialization
  async initialize(config?: GPUContextConfig): Promise<void>;
  
  // Resource management
  deferDestroy(buffer: GPUBuffer): void;
  async sync(): Promise<void>;
  
  // Command submission
  createCommandEncoder(): GPUCommandEncoder;
  submit(buffers: GPUCommandBuffer[]): void;
}

Tensor

Multi-dimensional array abstraction:

GPU buffer ownership
Layout metadata (NCHW/NHWC)
Zero-copy reshape via views

typescript

class Tensor {
  readonly shape: number[];
  readonly layout: DataLayout;
  readonly buffer: GPUBuffer;
  
  // Data transfer
  upload(data: Float32Array): Promise<void>;
  download(): Promise<Float32Array>;
  
  // View operations (zero-copy)
  reshape(newShape: number[]): Tensor;
  
  // Layout conversion
  convertLayout(target: DataLayout): Promise<Tensor>;
}

View System:

Views share the same GPU buffer
destroy() is a no-op for views
Enables efficient Flatten operations

Operator Layer

Neural network operations implemented as compute shaders.

Operator Base Class

typescript

abstract class Operator {
  protected context: GPUContext;
  protected pipeline: GPUComputePipeline;
  protected shaderModule: GPUShaderModule;
  
  // Shader compilation
  protected abstract compileShader(): string;
  
  // Shape inference
  abstract computeOutputShape(
    inputShape: TensorShape,
    params?: OperatorParams
  ): TensorShape;
  
  // Execution
  abstract forward(
    inputs: Tensor[],
    params?: OperatorParams
  ): Promise<Tensor>;
  
  // Lazy initialization
  protected ensureInitialized(): void;
}

Shader Compilation Flow

Operator.forward()
    ↓
ensureInitialized() ──first call──→ compileShader()
    ↓                                    ↓
    ←←←←←←←←← pipeline created ←←←←←←←←
    ↓
encode commands → submit → return output tensor

Application Layer

High-level APIs for model execution.

InferenceEngine

Orchestrates the entire inference pipeline:

Operator registration and management
Model loading and weight initialization
Layer-by-layer execution
Memory cleanup

typescript

class InferenceEngine {
  private context: GPUContext;
  private operators: Map<string, Operator>;
  private weights: Map<string, Tensor>;
  
  async infer(input: Tensor): Promise<Tensor> {
    // 1. Validate input
    // 2. Execute layers in topological order
    // 3. Cleanup intermediate tensors
    // 4. Return output
  }
}

Execution Flow:

input ──┐
        ├→ [conv2d] → [relu] → [maxpool] → [flatten] → [dense] → [softmax] ──→ output
weights─┘

Memory Management

Buffer Lifecycle

Create → Use → Destroy
   │      │       │
   │      │       └── deferDestroy() queues for after GPU work
   │      └────────── encode commands, submit to queue
   └───────────────── GPUContext.createBuffer()

Deferred Destruction

Prevents use-after-free by delaying buffer destruction:

typescript

// Safe pattern
const tempBuffer = context.createBuffer({...});
context.submit(commandsUsingBuffer);
context.deferDestroy(tempBuffer);
// Buffer destroyed after GPU work completes

View Semantics

typescript

// Original tensor owns the buffer
const original = new Tensor(context, [1, 3, 224, 224]);

// View shares the buffer (zero overhead)
const view = original.reshape([1, 3 * 224 * 224]);

// Cleanup
view.destroy();  // No-op: doesn't own buffer
original.destroy();  // Actually destroys GPU buffer

Data Flow

Forward Pass

Input Tensor (CPU memory)
    ↓ upload()
Input Tensor (GPU buffer)
    ↓ operator.forward()
Command Encoder
    ↓ encode compute pass
Command Buffer
    ↓ submit()
GPU Queue
    ↓ execute shader
Output Tensor (GPU buffer)
    ↓ download()
Output Array (CPU memory)

Inference Pipeline

typescript

// 1. Initialize
const engine = new InferenceEngine();
await engine.initialize();

// 2. Load model
await engine.loadModel(modelDef);
//    - Allocate weight tensors
//    - Register operators

// 3. Prepare input
const input = engine.tensorFromArray(data, shape);

// 4. Execute
const output = await engine.infer(input);
//    For each layer:
//    - Get operator
//    - Execute forward pass
//    - Store output
//    - Destroy intermediate tensors

// 5. Retrieve results
const predictions = await output.download();

// 6. Cleanup
engine.destroy();

WebGPU Integration

Pipeline Creation

typescript

// 1. Compile WGSL shader
const shaderModule = device.createShaderModule({
  code: wgslCode,
  label: 'operator-shader'
});

// 2. Create bind group layout
const bindGroupLayout = device.createBindGroupLayout({
  entries: [
    { binding: 0, visibility: COMPUTE, buffer: { type: 'storage' } },
    { binding: 1, visibility: COMPUTE, buffer: { type: 'read-only-storage' } }
  ]
});

// 3. Create pipeline layout
const pipelineLayout = device.createPipelineLayout({
  bindGroupLayouts: [bindGroupLayout]
});

// 4. Create compute pipeline
const pipeline = device.createComputePipeline({
  layout: pipelineLayout,
  compute: { module: shaderModule, entryPoint: 'main' }
});

Dispatch Strategy

Workgroup size optimized for GPU occupancy:

wgsl

@compute @workgroup_size(256)
fn main(@builtin(global_invocation_id) global_id: vec3<u32>) {
  let idx = global_id.x;
  if (idx >= numElements) { return; }
  
  // Process element at idx
  output[idx] = activation(input[idx]);
}

Dispatch size calculation:

typescript

const workgroupSize = 256;
const numWorkgroups = Math.ceil(numElements / workgroupSize);
pass.dispatchWorkgroups(numWorkgroups);

Optimization Architecture

Kernel Fusion

Combines multiple operations into a single GPU kernel:

Without Fusion:
  Conv2d  →  Bias  →  ReLU
    6 memory operations (read/write each)

With Fusion:
  Conv2d+Bias+ReLU (single kernel)
    2 memory operations
    
Speedup: 3x memory bandwidth reduction

Memory Layout Strategy

NCHW (Channel-first):

Natural for convolution operations
Better cache locality for channel-wise ops
PyTorch-compatible

NHWC (Channel-last):

Better memory coalescing on GPU
TensorFlow-compatible
Conversion utilities provided

Error Handling

Error Hierarchy

Error
├── WebGPUNotSupportedError
├── DeviceInitializationError
├── InvalidShapeError
└── BufferSizeError

Recovery Strategies

Error Type	Recovery
WebGPUNotSupported	Fallback to CPU or show message
DeviceInitialization	Retry with lower power preference
InvalidShape	Validate shapes before operation
BufferSize	Check data length matches tensor size

Extension Points

Adding Custom Operators

typescript

class CustomOperator extends Operator {
  protected compileShader(): string {
    return `
      @compute @workgroup_size(256)
      fn main(@builtin(global_invocation_id) id: vec3<u32>) {
        // Custom computation
      }
    `;
  }
  
  computeOutputShape(input: TensorShape): TensorShape {
    // Return output shape
  }
  
  async forward(inputs: Tensor[], params: OperatorParams): Promise<Tensor> {
    // Implementation
  }
}

Custom Model Formats

Extend ModelLoader to support different formats:

typescript

class ONNXLoader extends ModelLoader {
  async loadFromONNX(path: string): Promise<ModelDefinition> {
    // Parse ONNX and convert to internal format
  }
}

Performance Considerations

Bottlenecks

CPU-GPU Transfer: Minimize upload/download frequency
Shader Compilation: Cache pipelines when possible
Memory Allocation: Reuse buffers for variable-sized inputs
Synchronization: Batch operations to reduce sync points

Best Practices

Use fused operators where available
Process batches of inputs together
Pre-allocate tensors for repeated inference
Use deferDestroy() for temporary buffers
Call sync() only when necessary

Architecture ​

Overview ​

Layer Descriptions ​

Device Layer ​

Core Layer ​

GPUContext ​

Tensor ​

Operator Layer ​

Operator Base Class ​

Shader Compilation Flow ​

Application Layer ​

InferenceEngine ​

Memory Management ​

Buffer Lifecycle ​

Deferred Destruction ​

View Semantics ​

Data Flow ​

Forward Pass ​

Inference Pipeline ​

WebGPU Integration ​

Pipeline Creation ​

Dispatch Strategy ​

Optimization Architecture ​

Kernel Fusion ​

Memory Layout Strategy ​

Error Handling ​

Error Hierarchy ​

Recovery Strategies ​

Extension Points ​

Adding Custom Operators ​

Custom Model Formats ​

Performance Considerations ​

Bottlenecks ​

Best Practices ​

Architecture

Overview

Layer Descriptions

Device Layer

Core Layer

GPUContext

Tensor

Operator Layer

Operator Base Class

Shader Compilation Flow

Application Layer

InferenceEngine

Memory Management

Buffer Lifecycle

Deferred Destruction

View Semantics

Data Flow

Forward Pass

Inference Pipeline

WebGPU Integration

Pipeline Creation

Dispatch Strategy

Optimization Architecture

Kernel Fusion

Memory Layout Strategy

Error Handling

Error Hierarchy

Recovery Strategies

Extension Points

Adding Custom Operators

Custom Model Formats

Performance Considerations

Bottlenecks

Best Practices