Architecture
System architecture and design principles of Tiny-DL-Inference.
Overview
Tiny-DL-Inference follows a layered architecture that separates concerns between device management, tensor operations, neural network operators, and high-level inference execution.
┌─────────────────────────────────────────────────────────────────┐
│ Application Layer │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ Model Loader│ │ Benchmark │ │ Inference Runner │ │
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│ Operator Layer │
│ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌───────────────┐ │
│ │ Conv2d │ │MaxPool │ │ ReLU │ │Softmax │ │Conv2d+Bias+ReLU│ │
│ └────────┘ └────────┘ └────────┘ └────────┘ └───────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│ Core Layer │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Tensor │ │ ShaderManager │ │ GPUContext │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│ Device Layer │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ WebGPU Adapter/Device ││
│ └─────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Layer Descriptions
Device Layer
The foundation that interfaces with the GPU hardware.
Components:
GPUAdapter- Physical GPU selectionGPUDevice- Logical device with command queuesGPUQueue- Command submission
Responsibilities:
- Hardware abstraction
- Command buffer submission
- Synchronization primitives
Core Layer
Provides fundamental abstractions for GPU programming.
GPUContext
Centralized resource management:
- Device initialization and configuration
- Buffer allocation and lifecycle
- Command encoding and submission
- Deferred resource destruction
class GPUContext {
private adapter: GPUAdapter;
private device: GPUDevice;
private pendingCleanup: Set<Promise<void>>;
// Lazy initialization
async initialize(config?: GPUContextConfig): Promise<void>;
// Resource management
deferDestroy(buffer: GPUBuffer): void;
async sync(): Promise<void>;
// Command submission
createCommandEncoder(): GPUCommandEncoder;
submit(buffers: GPUCommandBuffer[]): void;
}2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Tensor
Multi-dimensional array abstraction:
- GPU buffer ownership
- Layout metadata (NCHW/NHWC)
- Zero-copy reshape via views
class Tensor {
readonly shape: number[];
readonly layout: DataLayout;
readonly buffer: GPUBuffer;
// Data transfer
upload(data: Float32Array): Promise<void>;
download(): Promise<Float32Array>;
// View operations (zero-copy)
reshape(newShape: number[]): Tensor;
// Layout conversion
convertLayout(target: DataLayout): Promise<Tensor>;
}2
3
4
5
6
7
8
9
10
11
12
13
14
15
View System:
- Views share the same GPU buffer
destroy()is a no-op for views- Enables efficient
Flattenoperations
Operator Layer
Neural network operations implemented as compute shaders.
Operator Base Class
abstract class Operator {
protected context: GPUContext;
protected pipeline: GPUComputePipeline;
protected shaderModule: GPUShaderModule;
// Shader compilation
protected abstract compileShader(): string;
// Shape inference
abstract computeOutputShape(
inputShape: TensorShape,
params?: OperatorParams
): TensorShape;
// Execution
abstract forward(
inputs: Tensor[],
params?: OperatorParams
): Promise<Tensor>;
// Lazy initialization
protected ensureInitialized(): void;
}2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Shader Compilation Flow
Operator.forward()
↓
ensureInitialized() ──first call──→ compileShader()
↓ ↓
←←←←←←←←← pipeline created ←←←←←←←←
↓
encode commands → submit → return output tensor2
3
4
5
6
7
Application Layer
High-level APIs for model execution.
InferenceEngine
Orchestrates the entire inference pipeline:
- Operator registration and management
- Model loading and weight initialization
- Layer-by-layer execution
- Memory cleanup
class InferenceEngine {
private context: GPUContext;
private operators: Map<string, Operator>;
private weights: Map<string, Tensor>;
async infer(input: Tensor): Promise<Tensor> {
// 1. Validate input
// 2. Execute layers in topological order
// 3. Cleanup intermediate tensors
// 4. Return output
}
}2
3
4
5
6
7
8
9
10
11
12
Execution Flow:
input ──┐
├→ [conv2d] → [relu] → [maxpool] → [flatten] → [dense] → [softmax] ──→ output
weights─┘2
3
Memory Management
Buffer Lifecycle
Create → Use → Destroy
│ │ │
│ │ └── deferDestroy() queues for after GPU work
│ └────────── encode commands, submit to queue
└───────────────── GPUContext.createBuffer()2
3
4
5
Deferred Destruction
Prevents use-after-free by delaying buffer destruction:
// Safe pattern
const tempBuffer = context.createBuffer({...});
context.submit(commandsUsingBuffer);
context.deferDestroy(tempBuffer);
// Buffer destroyed after GPU work completes2
3
4
5
View Semantics
// Original tensor owns the buffer
const original = new Tensor(context, [1, 3, 224, 224]);
// View shares the buffer (zero overhead)
const view = original.reshape([1, 3 * 224 * 224]);
// Cleanup
view.destroy(); // No-op: doesn't own buffer
original.destroy(); // Actually destroys GPU buffer2
3
4
5
6
7
8
9
Data Flow
Forward Pass
Input Tensor (CPU memory)
↓ upload()
Input Tensor (GPU buffer)
↓ operator.forward()
Command Encoder
↓ encode compute pass
Command Buffer
↓ submit()
GPU Queue
↓ execute shader
Output Tensor (GPU buffer)
↓ download()
Output Array (CPU memory)2
3
4
5
6
7
8
9
10
11
12
13
Inference Pipeline
// 1. Initialize
const engine = new InferenceEngine();
await engine.initialize();
// 2. Load model
await engine.loadModel(modelDef);
// - Allocate weight tensors
// - Register operators
// 3. Prepare input
const input = engine.tensorFromArray(data, shape);
// 4. Execute
const output = await engine.infer(input);
// For each layer:
// - Get operator
// - Execute forward pass
// - Store output
// - Destroy intermediate tensors
// 5. Retrieve results
const predictions = await output.download();
// 6. Cleanup
engine.destroy();2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
WebGPU Integration
Pipeline Creation
// 1. Compile WGSL shader
const shaderModule = device.createShaderModule({
code: wgslCode,
label: 'operator-shader'
});
// 2. Create bind group layout
const bindGroupLayout = device.createBindGroupLayout({
entries: [
{ binding: 0, visibility: COMPUTE, buffer: { type: 'storage' } },
{ binding: 1, visibility: COMPUTE, buffer: { type: 'read-only-storage' } }
]
});
// 3. Create pipeline layout
const pipelineLayout = device.createPipelineLayout({
bindGroupLayouts: [bindGroupLayout]
});
// 4. Create compute pipeline
const pipeline = device.createComputePipeline({
layout: pipelineLayout,
compute: { module: shaderModule, entryPoint: 'main' }
});2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Dispatch Strategy
Workgroup size optimized for GPU occupancy:
@compute @workgroup_size(256)
fn main(@builtin(global_invocation_id) global_id: vec3<u32>) {
let idx = global_id.x;
if (idx >= numElements) { return; }
// Process element at idx
output[idx] = activation(input[idx]);
}2
3
4
5
6
7
8
Dispatch size calculation:
const workgroupSize = 256;
const numWorkgroups = Math.ceil(numElements / workgroupSize);
pass.dispatchWorkgroups(numWorkgroups);2
3
Optimization Architecture
Kernel Fusion
Combines multiple operations into a single GPU kernel:
Without Fusion:
Conv2d → Bias → ReLU
6 memory operations (read/write each)
With Fusion:
Conv2d+Bias+ReLU (single kernel)
2 memory operations
Speedup: 3x memory bandwidth reduction2
3
4
5
6
7
8
9
Memory Layout Strategy
NCHW (Channel-first):
- Natural for convolution operations
- Better cache locality for channel-wise ops
- PyTorch-compatible
NHWC (Channel-last):
- Better memory coalescing on GPU
- TensorFlow-compatible
- Conversion utilities provided
Error Handling
Error Hierarchy
Error
├── WebGPUNotSupportedError
├── DeviceInitializationError
├── InvalidShapeError
└── BufferSizeError2
3
4
5
Recovery Strategies
| Error Type | Recovery |
|---|---|
| WebGPUNotSupported | Fallback to CPU or show message |
| DeviceInitialization | Retry with lower power preference |
| InvalidShape | Validate shapes before operation |
| BufferSize | Check data length matches tensor size |
Extension Points
Adding Custom Operators
class CustomOperator extends Operator {
protected compileShader(): string {
return `
@compute @workgroup_size(256)
fn main(@builtin(global_invocation_id) id: vec3<u32>) {
// Custom computation
}
`;
}
computeOutputShape(input: TensorShape): TensorShape {
// Return output shape
}
async forward(inputs: Tensor[], params: OperatorParams): Promise<Tensor> {
// Implementation
}
}2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Custom Model Formats
Extend ModelLoader to support different formats:
class ONNXLoader extends ModelLoader {
async loadFromONNX(path: string): Promise<ModelDefinition> {
// Parse ONNX and convert to internal format
}
}2
3
4
5
Performance Considerations
Bottlenecks
- CPU-GPU Transfer: Minimize upload/download frequency
- Shader Compilation: Cache pipelines when possible
- Memory Allocation: Reuse buffers for variable-sized inputs
- Synchronization: Batch operations to reduce sync points
Best Practices
- Use fused operators where available
- Process batches of inputs together
- Pre-allocate tensors for repeated inference
- Use
deferDestroy()for temporary buffers - Call
sync()only when necessary