AGENTS.md - AI Agent Workflow Configuration

Project Overview

Mini-Inference Engine is a CUDA-based neural network inference engine focused on GEMM (General Matrix Multiply) optimization. It demonstrates progressive CUDA optimization techniques from naive matrix multiplication to highly optimized kernels achieving ~85% of cuBLAS performance.

Tech Stack

Component	Version/Standard	Purpose
C++	17	Host code implementation
CUDA	11.0+ (Toolkit), 7.0+ (Compute Capability)	GPU kernels and device code
CMake	3.18+	Build system
cuBLAS	Bundled with CUDA	Performance baseline and comparison
Google Test	1.14.0 (fetched)	Unit testing framework

Supported GPU Architectures

Target compute capabilities: 75, 80, 86, 89, 90 (Turing through Blackwell)

Project Philosophy: OpenSpec Spec-Driven Development

This project follows the OpenSpec Spec-Driven Development (SDD) framework. All code implementations must use the specification documents in /openspec/specs/ as the Single Source of Truth.

OpenSpec Workflow

Use these slash commands for development:

Command	Purpose	When to Use
`/opsx:explore`	Think through ideas, investigate problems	Before proposing a change
`/opsx:propose`	Create change with proposal, design, tasks	Starting new work
`/opsx:apply`	Implement tasks from a change	Ready to code
`/opsx:verify`	Verify implementation matches specs	Before archiving
`/opsx:archive`	Archive completed change	Work finished
`/opsx:status`	Show status of changes and specs	Check progress

Specification Documents (`/openspec/specs/`)

Directory	Contents	When to Update
`product/`	Product feature definitions and acceptance criteria	Adding/changing features
`architecture/`	Technical design documents (RFCs)	Major architecture changes
`api/`	API interface definitions	Changing public interfaces
`data/`	Data schemas, model definitions	Changing data structures
`testing/`	BDD test specifications	Adding test requirements

Key Specification Files

product/gemm-optimization-requirements.md - Core requirements R1-R9
product/implementation-plan.md - Development roadmap
architecture/0001-core-architecture.md - System architecture design
architecture/0002-memory-pool.md - Memory pool design
architecture/0003-quantization.md - INT8 quantization
architecture/0004-stream-manager.md - CUDA stream management
architecture/0005-auto-tuner.md - Auto-tuning system
architecture/0006-logger-config-profiler.md - Infrastructure components
architecture/0007-half-precision-gemm.md - FP16 support
architecture/0008-batch-gemm.md - Batched operations

AI Agent Workflow Instructions

When you (the AI) are asked to develop a new feature, modify existing functionality, or fix a bug, follow the OpenSpec workflow:

Quick Start

Explore first (optional): Use /opsx:explore to think through the problem
Propose a change: Use /opsx:propose <name> to create a change proposal
Implement: Use /opsx:apply <name> to work through tasks
Verify: Use /opsx:verify <name> to check implementation
Archive: Use /opsx:archive <name> to finalize

Detailed Workflow

Step 1: Review Specifications

First, read the relevant documents in /openspec/specs/ directory
If the user’s request conflicts with existing specs, immediately stop coding and point out the conflict
Use /opsx:explore to investigate and clarify requirements

Step 2: Propose Change

Use /opsx:propose <name> to create a change proposal
This creates artifacts in openspec/changes/<name>/:
- proposal.md - What & why
- specs/ - Delta specs (ADDED/MODIFIED/REMOVED requirements)
- design.md - Technical approach
- tasks.md - Implementation checklist
Wait for user confirmation before proceeding

Step 3: Implement

Use /opsx:apply <name> to implement tasks
100% comply with the definitions in the specs
Do not add features not defined in the specs (No Gold-Plating)
Mark tasks complete as you go

Step 4: Verify & Archive

Use /opsx:verify <name> to check implementation matches specs
Use /opsx:archive <name> to sync specs and archive the change

Directory Structure

mini-inference-engine/
├── openspec/              # OpenSpec framework
│   ├── config.yaml        # OpenSpec configuration
│   ├── specs/             # Source of truth
│   │   ├── product/       # Requirements
│   │   ├── architecture/  # Technical designs (RFCs)
│   │   ├── api/           # API contracts
│   │   ├── data/          # Data schemas
│   │   └── testing/       # Test specifications
│   ├── changes/           # Active change proposals
│   └── archive/           # Completed changes
├── include/               # Header files (14 .h + 3 .cuh)
│   ├── common.h           # Core data structures, error handling, utilities
│   ├── kernels.cuh        # GEMM kernel declarations and launch wrappers
│   ├── tensor.h           # N-dimensional tensor class
│   ├── inference_engine.h # Neural network inference engine
│   ├── memory_pool.h      # GPU memory pool with caching
│   ├── stream_manager.h   # CUDA stream management
│   ├── config.h           # Configuration file parsing
│   ├── logger.h           # Logging infrastructure
│   ├── profiler.h         # Performance profiling
│   ├── autotuner.h        # Automatic kernel selection
│   ├── batch_gemm.h       # Batched GEMM operations
│   ├── quantization.h     # INT8 quantization support
│   ├── half_gemm.cuh      # FP16 GEMM kernels
│   └── vectorized_gemm.cuh   # Vectorized load optimizations
├── src/                   # Implementation files (11 total)
│   ├── naive_matmul.cu    # Level 1: Baseline implementation
│   ├── tiled_gemm.cu      # Level 2: Shared memory tiling
│   ├── coalesced_gemm.cu  # Level 3: Memory coalescing
│   ├── double_buffer_gemm.cu  # Level 4: Latency hiding
│   ├── optimized_gemm.cu  # Level 5: Register blocking
│   ├── fused_gemm.cu      # Level 6: Operator fusion (MatMul+Bias+ReLU)
│   ├── vectorized_gemm.cu # Level 7: Vectorized loads
│   ├── half_gemm.cu       # FP16 precision kernels
│   ├── tensor.cu          # Tensor operations implementation
│   ├── benchmark.cu       # Performance benchmarking utilities
│   └── inference_engine.cpp  # Engine implementation
├── tests/                 # Unit tests (11 files, 207 test cases)
│   ├── test_gemm.cu       # All GEMM kernel correctness tests
│   ├── test_fusion.cu     # Fusion kernel tests
│   ├── test_advanced.cu   # Advanced features tests
│   ├── test_tensor.cpp    # Tensor operations
│   ├── test_inference.cpp # InferenceEngine tests
│   ├── test_memory_pool.cpp  # MemoryPool tests
│   ├── test_stream_manager.cpp  # StreamManager tests
│   ├── test_config.cpp    # Config parsing tests
│   ├── test_logger.cpp    # Logger tests
│   ├── test_quantization.cpp  # Quantization tests
│   └── test_batch_gemm.cpp  # Batched GEMM tests
├── benchmarks/            # Performance benchmarks (3 files)
│   ├── benchmark.cpp      # Main benchmark runner
│   ├── detailed_benchmark.cu  # Detailed kernel analysis
│   └── mnist_demo.cpp     # MNIST inference demo
├── config/                # Runtime configuration examples
│   ├── default.ini        # Default settings
│   ├── debug.ini          # Debug settings (verbose logging)
│   └── high_performance.ini  # Production-optimized settings
├── scripts/               # Utility scripts
│   └── export_mnist_weights.py  # PyTorch weight export tool
├── docs/                  # User documentation
│   ├── en/                # English documentation (7 files)
│   ├── zh/                # Chinese documentation (7 files)
│   └── releases/          # Release notes
├── .claude/               # Claude Code configuration
│   ├── commands/opsx/     # OpenSpec slash commands
│   ├── skills/openspec/   # OpenSpec skill
│   └── settings.json      # Claude Code settings
├── CMakeLists.txt         # Build configuration
├── CMakePresets.json      # Build presets
├── AGENTS.md              # This file - AI workflow instructions
└── CLAUDE.md              # Claude Code project instructions

Architecture Overview

The project follows a 4-layer architecture:

┌─────────────────────────────────────────────────────────────────────┐
│                       Application Layer                              │
│   Benchmark  │  MNIST Demo  │  Tests  │  Your Application          │
├─────────────────────────────────────────────────────────────────────┤
│                         Engine Layer                                 │
│   InferenceEngine  │  Tensor  │  AutoTuner  │  Profiler            │
├─────────────────────────────────────────────────────────────────────┤
│                         Kernel Layer                                 │
│   Naive │ Tiled │ Coalesced │ DoubleBuffer │ Optimized │ Fused    │
│   Vectorized │ Half-Precision │ Batched │ cuBLAS wrapper          │
├─────────────────────────────────────────────────────────────────────┤
│                     Infrastructure Layer                             │
│   MemoryPool  │  StreamManager  │  Logger  │  Config               │
└─────────────────────────────────────────────────────────────────────┘

GEMM Optimization Path

Level	Technique	vs cuBLAS	Key Optimization
1	Naive	~10%	Baseline: each thread computes one element
2	Tiled	~20%	Shared memory 32×32 tiles
3	Coalesced	~30%	Warp-level memory coalescing
4	Double Buffer	~40%	Prefetching to hide latency
5	Register Blocked	~65%	Register tiling for compute density
6	Fused	~75%	MatMul+Bias+ReLU in single kernel
7	Vectorized	~85%	float4 vectorized loads

Build Commands

Prerequisites

CUDA Toolkit 11.0 or higher
CMake 3.18 or higher
C++17 compatible compiler (GCC 9+, MSVC 2019+)
NVIDIA GPU with compute capability 7.0+

Build Presets

# Debug build with tests (recommended for development)
cmake --preset default
cmake --build --preset default

# Release build without tests (for benchmarking)
cmake --preset release
cmake --build --preset release

# CI build (Release + tests)
cmake --preset ci
cmake --build --preset ci

Manual Build (without presets)

# Configure
cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTS=ON

# Build
cmake --build build --parallel $(nproc)

# Install (optional)
cmake --install build --prefix /usr/local

Build Options

Option	Default	Description
`BUILD_TESTS`	ON	Build test suite with Google Test
`ENABLE_FAST_MATH`	ON (Release)	Enable `--use_fast_math` for CUDA
`CMAKE_CUDA_ARCHITECTURES`	75;80;86;89;90	Target GPU architectures

Test Commands

# Run all tests (requires NVIDIA GPU)
ctest --preset default

# Run with verbose output on failure
ctest --preset default --output-on-failure

# Alternative: run test binary directly
./build/tests

Note: Tests require an NVIDIA GPU with CUDA support and cannot run on standard CI runners without GPU access.

Code Style Guidelines

Formatting

Follow .clang-format configuration (Google-based style)
4-space indentation, 100 column limit
Use clang-format --style=file -i <file> to format files

# Format a single file
clang-format --style=file -i src/my_file.cu

# Format all source files
find src include tests -name "*.cpp" -o -name "*.cu" -o -name "*.h" -o -name "*.cuh" | \
  xargs clang-format --style=file -i

Naming Conventions

// C++ naming
class ClassName;              // PascalCase
void function_name();         // snake_case
int variable_name;            // snake_case
const int CONSTANT_NAME;      // UPPER_SNAKE_CASE
int member_variable_;         // snake_case with trailing underscore

// CUDA specifics
__global__ void my_kernel();  // snake_case
template<int BLOCK_SIZE>      // Template parameters: UPPER_SNAKE_CASE
__shared__ float s_data[256]; // Prefix: s_ for shared memory
float r_sum = 0.0f;           // Prefix: r_ for registers

Code Organization Rules

Explicit source list in CMake: Do NOT use GLOB_RECURSE for source files. Add new files to the explicit list in CMakeLists.txt.
Header-only templates: CUDA kernel templates should be in .cuh headers when used across multiple translation units.
Error handling: Use CUDA_CHECK() and CUBLAS_CHECK() macros for all CUDA API calls.
RAII resources: Use DeviceMemory or PooledMemory for automatic GPU memory management.

Testing Strategy

Test Categories

Correctness Tests - Verify numerical accuracy against CPU reference
Equivalence Tests - All GEMM variants produce identical results
Property Tests - Invariants hold across random inputs
Integration Tests - End-to-end inference scenarios

Writing Tests

Tests use Google Test framework. Example pattern:

#include <gtest/gtest.h>
#include "common.h"
#include "kernels.cuh"

using namespace mini_inference;

class MyTest : public ::testing::Test {
protected:
    void SetUp() override {
        CUDA_CHECK(cudaSetDevice(0));
    }
};

TEST_F(MyTest, TestName) {
    // Arrange
    DeviceMemory d_A(size);
    
    // Act
    launch_kernel(...);
    CUDA_CHECK(cudaDeviceSynchronize());
    
    // Assert
    EXPECT_LT(max_error, 1e-5f);
}

Commit Style

Follow Conventional Commits:

Type	Description
`feat:`	New features
`fix:`	Bug fixes
`docs:`	Documentation changes
`perf:`	Performance improvements
`refactor:`	Code refactoring
`test:`	Test changes
`chore:`	Build/tool changes

Include requirement IDs from specs when applicable:

feat(gemm): implement tiling optimization

- Implement 32x32 tile blocking (R2.1)
- Add shared memory loading (R2.2, R2.3)
- Handle boundary conditions (R2.4)

Closes #42

Security Considerations

GPU Memory: Sensitive data in GPU memory is not automatically cleared. Call zero() on buffers handling sensitive information.
File I/O: Weight files use a custom binary format with magic bytes for validation. Always validate file headers before loading.
CUDA Errors: All CUDA API errors throw exceptions that must be caught at appropriate boundaries.

Deployment

GitHub Pages (Documentation)

Documentation is automatically deployed to GitHub Pages via Jekyll when changes are pushed to docs/, index.md, _config.yml, or Gemfile.

CI/CD

.github/workflows/ci.yml - Build verification on push/PR
.github/workflows/pages.yml - Documentation deployment

CI runs:

Build tests (Debug and Release)
Code format verification with clang-format
Documentation structure validation

Note: GPU tests are disabled in CI because standard runners lack NVIDIA GPUs. Tests must be run locally or on self-hosted GPU runners.

Quick Reference

Essential Commands

# Full development cycle
cmake --preset default && cmake --build --preset default && ctest --preset default

# Format all code
find src include tests -name "*.cpp" -o -name "*.cu" -o -name "*.h" -o -name "*.cuh" | \
  xargs clang-format --style=file -i

# Run benchmark
./build-release/benchmark

# Run MNIST demo
python scripts/export_mnist_weights.py -o mnist_weights.bin
./build-release/mnist_demo mnist_weights.bin

Key Files for Common Tasks

Task	File(s)
Add new kernel	`include/kernels.cuh`, `src/`, add to CMakeLists.txt
Update build	`CMakeLists.txt`, `CMakePresets.json`
Update docs	`docs/en/`, `docs/zh/`, `index.md`
Update specs	`openspec/specs/product/`, `openspec/specs/architecture/`
Update config schema	`include/config.h`, `config/*.ini`

Why This Matters

OpenSpec Benefits

Structured changes: Every change has proposal, design, tasks
Spec versioning: Delta specs track what changed and why
Verification: Check implementation matches specs before archiving
Traceability: Archive preserves history with timestamps

Preventing AI Hallucinations

AI tends to “freestyle” without proper context. Forcing it to read /openspec/specs/ first anchors its thinking scope.

Constraining Modification Paths

Using /opsx:propose before coding ensures document-code synchronization forever.