AGENTS.md - AI Agent Workflow Configuration

Project Overview

Mini-Inference Engine is a CUDA-based neural network inference engine focused on GEMM (General Matrix Multiply) optimization. It demonstrates progressive CUDA optimization techniques from naive matrix multiplication to highly optimized kernels achieving ~85% of cuBLAS performance.

Tech Stack

Component Version/Standard Purpose
C++ 17 Host code implementation
CUDA 11.0+ (Toolkit), 7.0+ (Compute Capability) GPU kernels and device code
CMake 3.18+ Build system
cuBLAS Bundled with CUDA Performance baseline and comparison
Google Test 1.14.0 (fetched) Unit testing framework

Supported GPU Architectures

Target compute capabilities: 75, 80, 86, 89, 90 (Turing through Blackwell)


Project Philosophy: OpenSpec Spec-Driven Development

This project follows the OpenSpec Spec-Driven Development (SDD) framework. All code implementations must use the specification documents in /openspec/specs/ as the Single Source of Truth.

OpenSpec Workflow

Use these slash commands for development:

Command Purpose When to Use
/opsx:explore Think through ideas, investigate problems Before proposing a change
/opsx:propose Create change with proposal, design, tasks Starting new work
/opsx:apply Implement tasks from a change Ready to code
/opsx:verify Verify implementation matches specs Before archiving
/opsx:archive Archive completed change Work finished
/opsx:status Show status of changes and specs Check progress

Specification Documents (/openspec/specs/)

Directory Contents When to Update
product/ Product feature definitions and acceptance criteria Adding/changing features
architecture/ Technical design documents (RFCs) Major architecture changes
api/ API interface definitions Changing public interfaces
data/ Data schemas, model definitions Changing data structures
testing/ BDD test specifications Adding test requirements

Key Specification Files

  • product/gemm-optimization-requirements.md - Core requirements R1-R9
  • product/implementation-plan.md - Development roadmap
  • architecture/0001-core-architecture.md - System architecture design
  • architecture/0002-memory-pool.md - Memory pool design
  • architecture/0003-quantization.md - INT8 quantization
  • architecture/0004-stream-manager.md - CUDA stream management
  • architecture/0005-auto-tuner.md - Auto-tuning system
  • architecture/0006-logger-config-profiler.md - Infrastructure components
  • architecture/0007-half-precision-gemm.md - FP16 support
  • architecture/0008-batch-gemm.md - Batched operations

AI Agent Workflow Instructions

When you (the AI) are asked to develop a new feature, modify existing functionality, or fix a bug, follow the OpenSpec workflow:

Quick Start

  1. Explore first (optional): Use /opsx:explore to think through the problem
  2. Propose a change: Use /opsx:propose <name> to create a change proposal
  3. Implement: Use /opsx:apply <name> to work through tasks
  4. Verify: Use /opsx:verify <name> to check implementation
  5. Archive: Use /opsx:archive <name> to finalize

Detailed Workflow

Step 1: Review Specifications

  • First, read the relevant documents in /openspec/specs/ directory
  • If the user’s request conflicts with existing specs, immediately stop coding and point out the conflict
  • Use /opsx:explore to investigate and clarify requirements

Step 2: Propose Change

  • Use /opsx:propose <name> to create a change proposal
  • This creates artifacts in openspec/changes/<name>/:
    • proposal.md - What & why
    • specs/ - Delta specs (ADDED/MODIFIED/REMOVED requirements)
    • design.md - Technical approach
    • tasks.md - Implementation checklist
  • Wait for user confirmation before proceeding

Step 3: Implement

  • Use /opsx:apply <name> to implement tasks
  • 100% comply with the definitions in the specs
  • Do not add features not defined in the specs (No Gold-Plating)
  • Mark tasks complete as you go

Step 4: Verify & Archive

  • Use /opsx:verify <name> to check implementation matches specs
  • Use /opsx:archive <name> to sync specs and archive the change

Directory Structure

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
mini-inference-engine/
├── openspec/              # OpenSpec framework
│   ├── config.yaml        # OpenSpec configuration
│   ├── specs/             # Source of truth
│   │   ├── product/       # Requirements
│   │   ├── architecture/  # Technical designs (RFCs)
│   │   ├── api/           # API contracts
│   │   ├── data/          # Data schemas
│   │   └── testing/       # Test specifications
│   ├── changes/           # Active change proposals
│   └── archive/           # Completed changes
├── include/               # Header files (14 .h + 3 .cuh)
│   ├── common.h           # Core data structures, error handling, utilities
│   ├── kernels.cuh        # GEMM kernel declarations and launch wrappers
│   ├── tensor.h           # N-dimensional tensor class
│   ├── inference_engine.h # Neural network inference engine
│   ├── memory_pool.h      # GPU memory pool with caching
│   ├── stream_manager.h   # CUDA stream management
│   ├── config.h           # Configuration file parsing
│   ├── logger.h           # Logging infrastructure
│   ├── profiler.h         # Performance profiling
│   ├── autotuner.h        # Automatic kernel selection
│   ├── batch_gemm.h       # Batched GEMM operations
│   ├── quantization.h     # INT8 quantization support
│   ├── half_gemm.cuh      # FP16 GEMM kernels
│   └── vectorized_gemm.cuh   # Vectorized load optimizations
├── src/                   # Implementation files (11 total)
│   ├── naive_matmul.cu    # Level 1: Baseline implementation
│   ├── tiled_gemm.cu      # Level 2: Shared memory tiling
│   ├── coalesced_gemm.cu  # Level 3: Memory coalescing
│   ├── double_buffer_gemm.cu  # Level 4: Latency hiding
│   ├── optimized_gemm.cu  # Level 5: Register blocking
│   ├── fused_gemm.cu      # Level 6: Operator fusion (MatMul+Bias+ReLU)
│   ├── vectorized_gemm.cu # Level 7: Vectorized loads
│   ├── half_gemm.cu       # FP16 precision kernels
│   ├── tensor.cu          # Tensor operations implementation
│   ├── benchmark.cu       # Performance benchmarking utilities
│   └── inference_engine.cpp  # Engine implementation
├── tests/                 # Unit tests (11 files, 207 test cases)
│   ├── test_gemm.cu       # All GEMM kernel correctness tests
│   ├── test_fusion.cu     # Fusion kernel tests
│   ├── test_advanced.cu   # Advanced features tests
│   ├── test_tensor.cpp    # Tensor operations
│   ├── test_inference.cpp # InferenceEngine tests
│   ├── test_memory_pool.cpp  # MemoryPool tests
│   ├── test_stream_manager.cpp  # StreamManager tests
│   ├── test_config.cpp    # Config parsing tests
│   ├── test_logger.cpp    # Logger tests
│   ├── test_quantization.cpp  # Quantization tests
│   └── test_batch_gemm.cpp  # Batched GEMM tests
├── benchmarks/            # Performance benchmarks (3 files)
│   ├── benchmark.cpp      # Main benchmark runner
│   ├── detailed_benchmark.cu  # Detailed kernel analysis
│   └── mnist_demo.cpp     # MNIST inference demo
├── config/                # Runtime configuration examples
│   ├── default.ini        # Default settings
│   ├── debug.ini          # Debug settings (verbose logging)
│   └── high_performance.ini  # Production-optimized settings
├── scripts/               # Utility scripts
│   └── export_mnist_weights.py  # PyTorch weight export tool
├── docs/                  # User documentation
│   ├── en/                # English documentation (7 files)
│   ├── zh/                # Chinese documentation (7 files)
│   └── releases/          # Release notes
├── .claude/               # Claude Code configuration
│   ├── commands/opsx/     # OpenSpec slash commands
│   ├── skills/openspec/   # OpenSpec skill
│   └── settings.json      # Claude Code settings
├── CMakeLists.txt         # Build configuration
├── CMakePresets.json      # Build presets
├── AGENTS.md              # This file - AI workflow instructions
└── CLAUDE.md              # Claude Code project instructions

Architecture Overview

The project follows a 4-layer architecture:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
┌─────────────────────────────────────────────────────────────────────┐
│                       Application Layer                              │
│   Benchmark  │  MNIST Demo  │  Tests  │  Your Application          │
├─────────────────────────────────────────────────────────────────────┤
│                         Engine Layer                                 │
│   InferenceEngine  │  Tensor  │  AutoTuner  │  Profiler            │
├─────────────────────────────────────────────────────────────────────┤
│                         Kernel Layer                                 │
│   Naive │ Tiled │ Coalesced │ DoubleBuffer │ Optimized │ Fused    │
│   Vectorized │ Half-Precision │ Batched │ cuBLAS wrapper          │
├─────────────────────────────────────────────────────────────────────┤
│                     Infrastructure Layer                             │
│   MemoryPool  │  StreamManager  │  Logger  │  Config               │
└─────────────────────────────────────────────────────────────────────┘

GEMM Optimization Path

Level Technique vs cuBLAS Key Optimization
1 Naive ~10% Baseline: each thread computes one element
2 Tiled ~20% Shared memory 32×32 tiles
3 Coalesced ~30% Warp-level memory coalescing
4 Double Buffer ~40% Prefetching to hide latency
5 Register Blocked ~65% Register tiling for compute density
6 Fused ~75% MatMul+Bias+ReLU in single kernel
7 Vectorized ~85% float4 vectorized loads

Build Commands

Prerequisites

  • CUDA Toolkit 11.0 or higher
  • CMake 3.18 or higher
  • C++17 compatible compiler (GCC 9+, MSVC 2019+)
  • NVIDIA GPU with compute capability 7.0+

Build Presets

1
2
3
4
5
6
7
8
9
10
11
# Debug build with tests (recommended for development)
cmake --preset default
cmake --build --preset default

# Release build without tests (for benchmarking)
cmake --preset release
cmake --build --preset release

# CI build (Release + tests)
cmake --preset ci
cmake --build --preset ci

Manual Build (without presets)

1
2
3
4
5
6
7
8
# Configure
cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTS=ON

# Build
cmake --build build --parallel $(nproc)

# Install (optional)
cmake --install build --prefix /usr/local

Build Options

Option Default Description
BUILD_TESTS ON Build test suite with Google Test
ENABLE_FAST_MATH ON (Release) Enable --use_fast_math for CUDA
CMAKE_CUDA_ARCHITECTURES 75;80;86;89;90 Target GPU architectures

Test Commands

1
2
3
4
5
6
7
8
# Run all tests (requires NVIDIA GPU)
ctest --preset default

# Run with verbose output on failure
ctest --preset default --output-on-failure

# Alternative: run test binary directly
./build/tests

Note: Tests require an NVIDIA GPU with CUDA support and cannot run on standard CI runners without GPU access.


Code Style Guidelines

Formatting

  • Follow .clang-format configuration (Google-based style)
  • 4-space indentation, 100 column limit
  • Use clang-format --style=file -i <file> to format files
1
2
3
4
5
6
# Format a single file
clang-format --style=file -i src/my_file.cu

# Format all source files
find src include tests -name "*.cpp" -o -name "*.cu" -o -name "*.h" -o -name "*.cuh" | \
  xargs clang-format --style=file -i

Naming Conventions

1
2
3
4
5
6
7
8
9
10
11
12
// C++ naming
class ClassName;              // PascalCase
void function_name();         // snake_case
int variable_name;            // snake_case
const int CONSTANT_NAME;      // UPPER_SNAKE_CASE
int member_variable_;         // snake_case with trailing underscore

// CUDA specifics
__global__ void my_kernel();  // snake_case
template<int BLOCK_SIZE>      // Template parameters: UPPER_SNAKE_CASE
__shared__ float s_data[256]; // Prefix: s_ for shared memory
float r_sum = 0.0f;           // Prefix: r_ for registers

Code Organization Rules

  1. Explicit source list in CMake: Do NOT use GLOB_RECURSE for source files. Add new files to the explicit list in CMakeLists.txt.

  2. Header-only templates: CUDA kernel templates should be in .cuh headers when used across multiple translation units.

  3. Error handling: Use CUDA_CHECK() and CUBLAS_CHECK() macros for all CUDA API calls.

  4. RAII resources: Use DeviceMemory or PooledMemory for automatic GPU memory management.


Testing Strategy

Test Categories

  1. Correctness Tests - Verify numerical accuracy against CPU reference
  2. Equivalence Tests - All GEMM variants produce identical results
  3. Property Tests - Invariants hold across random inputs
  4. Integration Tests - End-to-end inference scenarios

Writing Tests

Tests use Google Test framework. Example pattern:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#include <gtest/gtest.h>
#include "common.h"
#include "kernels.cuh"

using namespace mini_inference;

class MyTest : public ::testing::Test {
protected:
    void SetUp() override {
        CUDA_CHECK(cudaSetDevice(0));
    }
};

TEST_F(MyTest, TestName) {
    // Arrange
    DeviceMemory d_A(size);
    
    // Act
    launch_kernel(...);
    CUDA_CHECK(cudaDeviceSynchronize());
    
    // Assert
    EXPECT_LT(max_error, 1e-5f);
}

Commit Style

Follow Conventional Commits:

Type Description
feat: New features
fix: Bug fixes
docs: Documentation changes
perf: Performance improvements
refactor: Code refactoring
test: Test changes
chore: Build/tool changes

Include requirement IDs from specs when applicable:

1
2
3
4
5
6
7
feat(gemm): implement tiling optimization

- Implement 32x32 tile blocking (R2.1)
- Add shared memory loading (R2.2, R2.3)
- Handle boundary conditions (R2.4)

Closes #42

Security Considerations

  • GPU Memory: Sensitive data in GPU memory is not automatically cleared. Call zero() on buffers handling sensitive information.
  • File I/O: Weight files use a custom binary format with magic bytes for validation. Always validate file headers before loading.
  • CUDA Errors: All CUDA API errors throw exceptions that must be caught at appropriate boundaries.

Deployment

GitHub Pages (Documentation)

Documentation is automatically deployed to GitHub Pages via Jekyll when changes are pushed to docs/, index.md, _config.yml, or Gemfile.

CI/CD

  • .github/workflows/ci.yml - Build verification on push/PR
  • .github/workflows/pages.yml - Documentation deployment

CI runs:

  1. Build tests (Debug and Release)
  2. Code format verification with clang-format
  3. Documentation structure validation

Note: GPU tests are disabled in CI because standard runners lack NVIDIA GPUs. Tests must be run locally or on self-hosted GPU runners.


Quick Reference

Essential Commands

1
2
3
4
5
6
7
8
9
10
11
12
13
# Full development cycle
cmake --preset default && cmake --build --preset default && ctest --preset default

# Format all code
find src include tests -name "*.cpp" -o -name "*.cu" -o -name "*.h" -o -name "*.cuh" | \
  xargs clang-format --style=file -i

# Run benchmark
./build-release/benchmark

# Run MNIST demo
python scripts/export_mnist_weights.py -o mnist_weights.bin
./build-release/mnist_demo mnist_weights.bin

Key Files for Common Tasks

Task File(s)
Add new kernel include/kernels.cuh, src/, add to CMakeLists.txt
Update build CMakeLists.txt, CMakePresets.json
Update docs docs/en/, docs/zh/, index.md
Update specs openspec/specs/product/, openspec/specs/architecture/
Update config schema include/config.h, config/*.ini

Why This Matters

OpenSpec Benefits

  • Structured changes: Every change has proposal, design, tasks
  • Spec versioning: Delta specs track what changed and why
  • Verification: Check implementation matches specs before archiving
  • Traceability: Archive preserves history with timestamps

Preventing AI Hallucinations

AI tends to “freestyle” without proper context. Forcing it to read /openspec/specs/ first anchors its thinking scope.

Constraining Modification Paths

Using /opsx:propose before coding ensures document-code synchronization forever.


Back to top

MIT License | A learning project for the CUDA community