AGENTS.md - AI Agent Workflow Configuration
Project Overview
Mini-Inference Engine is a CUDA-based neural network inference engine focused on GEMM (General Matrix Multiply) optimization. It demonstrates progressive CUDA optimization techniques from naive matrix multiplication to highly optimized kernels achieving ~85% of cuBLAS performance.
Tech Stack
| Component | Version/Standard | Purpose |
|---|---|---|
| C++ | 17 | Host code implementation |
| CUDA | 11.0+ (Toolkit), 7.0+ (Compute Capability) | GPU kernels and device code |
| CMake | 3.18+ | Build system |
| cuBLAS | Bundled with CUDA | Performance baseline and comparison |
| Google Test | 1.14.0 (fetched) | Unit testing framework |
Supported GPU Architectures
Target compute capabilities: 75, 80, 86, 89, 90 (Turing through Blackwell)
Project Philosophy: OpenSpec Spec-Driven Development
This project follows the OpenSpec Spec-Driven Development (SDD) framework. All code implementations must use the specification documents in /openspec/specs/ as the Single Source of Truth.
OpenSpec Workflow
Use these slash commands for development:
| Command | Purpose | When to Use |
|---|---|---|
/opsx:explore |
Think through ideas, investigate problems | Before proposing a change |
/opsx:propose |
Create change with proposal, design, tasks | Starting new work |
/opsx:apply |
Implement tasks from a change | Ready to code |
/opsx:verify |
Verify implementation matches specs | Before archiving |
/opsx:archive |
Archive completed change | Work finished |
/opsx:status |
Show status of changes and specs | Check progress |
Specification Documents (/openspec/specs/)
| Directory | Contents | When to Update |
|---|---|---|
product/ |
Product feature definitions and acceptance criteria | Adding/changing features |
architecture/ |
Technical design documents (RFCs) | Major architecture changes |
api/ |
API interface definitions | Changing public interfaces |
data/ |
Data schemas, model definitions | Changing data structures |
testing/ |
BDD test specifications | Adding test requirements |
Key Specification Files
product/gemm-optimization-requirements.md- Core requirements R1-R9product/implementation-plan.md- Development roadmaparchitecture/0001-core-architecture.md- System architecture designarchitecture/0002-memory-pool.md- Memory pool designarchitecture/0003-quantization.md- INT8 quantizationarchitecture/0004-stream-manager.md- CUDA stream managementarchitecture/0005-auto-tuner.md- Auto-tuning systemarchitecture/0006-logger-config-profiler.md- Infrastructure componentsarchitecture/0007-half-precision-gemm.md- FP16 supportarchitecture/0008-batch-gemm.md- Batched operations
AI Agent Workflow Instructions
When you (the AI) are asked to develop a new feature, modify existing functionality, or fix a bug, follow the OpenSpec workflow:
Quick Start
- Explore first (optional): Use
/opsx:exploreto think through the problem - Propose a change: Use
/opsx:propose <name>to create a change proposal - Implement: Use
/opsx:apply <name>to work through tasks - Verify: Use
/opsx:verify <name>to check implementation - Archive: Use
/opsx:archive <name>to finalize
Detailed Workflow
Step 1: Review Specifications
- First, read the relevant documents in
/openspec/specs/directory - If the user’s request conflicts with existing specs, immediately stop coding and point out the conflict
- Use
/opsx:exploreto investigate and clarify requirements
Step 2: Propose Change
- Use
/opsx:propose <name>to create a change proposal - This creates artifacts in
openspec/changes/<name>/:proposal.md- What & whyspecs/- Delta specs (ADDED/MODIFIED/REMOVED requirements)design.md- Technical approachtasks.md- Implementation checklist
- Wait for user confirmation before proceeding
Step 3: Implement
- Use
/opsx:apply <name>to implement tasks - 100% comply with the definitions in the specs
- Do not add features not defined in the specs (No Gold-Plating)
- Mark tasks complete as you go
Step 4: Verify & Archive
- Use
/opsx:verify <name>to check implementation matches specs - Use
/opsx:archive <name>to sync specs and archive the change
Directory Structure
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
mini-inference-engine/
├── openspec/ # OpenSpec framework
│ ├── config.yaml # OpenSpec configuration
│ ├── specs/ # Source of truth
│ │ ├── product/ # Requirements
│ │ ├── architecture/ # Technical designs (RFCs)
│ │ ├── api/ # API contracts
│ │ ├── data/ # Data schemas
│ │ └── testing/ # Test specifications
│ ├── changes/ # Active change proposals
│ └── archive/ # Completed changes
├── include/ # Header files (14 .h + 3 .cuh)
│ ├── common.h # Core data structures, error handling, utilities
│ ├── kernels.cuh # GEMM kernel declarations and launch wrappers
│ ├── tensor.h # N-dimensional tensor class
│ ├── inference_engine.h # Neural network inference engine
│ ├── memory_pool.h # GPU memory pool with caching
│ ├── stream_manager.h # CUDA stream management
│ ├── config.h # Configuration file parsing
│ ├── logger.h # Logging infrastructure
│ ├── profiler.h # Performance profiling
│ ├── autotuner.h # Automatic kernel selection
│ ├── batch_gemm.h # Batched GEMM operations
│ ├── quantization.h # INT8 quantization support
│ ├── half_gemm.cuh # FP16 GEMM kernels
│ └── vectorized_gemm.cuh # Vectorized load optimizations
├── src/ # Implementation files (11 total)
│ ├── naive_matmul.cu # Level 1: Baseline implementation
│ ├── tiled_gemm.cu # Level 2: Shared memory tiling
│ ├── coalesced_gemm.cu # Level 3: Memory coalescing
│ ├── double_buffer_gemm.cu # Level 4: Latency hiding
│ ├── optimized_gemm.cu # Level 5: Register blocking
│ ├── fused_gemm.cu # Level 6: Operator fusion (MatMul+Bias+ReLU)
│ ├── vectorized_gemm.cu # Level 7: Vectorized loads
│ ├── half_gemm.cu # FP16 precision kernels
│ ├── tensor.cu # Tensor operations implementation
│ ├── benchmark.cu # Performance benchmarking utilities
│ └── inference_engine.cpp # Engine implementation
├── tests/ # Unit tests (11 files, 207 test cases)
│ ├── test_gemm.cu # All GEMM kernel correctness tests
│ ├── test_fusion.cu # Fusion kernel tests
│ ├── test_advanced.cu # Advanced features tests
│ ├── test_tensor.cpp # Tensor operations
│ ├── test_inference.cpp # InferenceEngine tests
│ ├── test_memory_pool.cpp # MemoryPool tests
│ ├── test_stream_manager.cpp # StreamManager tests
│ ├── test_config.cpp # Config parsing tests
│ ├── test_logger.cpp # Logger tests
│ ├── test_quantization.cpp # Quantization tests
│ └── test_batch_gemm.cpp # Batched GEMM tests
├── benchmarks/ # Performance benchmarks (3 files)
│ ├── benchmark.cpp # Main benchmark runner
│ ├── detailed_benchmark.cu # Detailed kernel analysis
│ └── mnist_demo.cpp # MNIST inference demo
├── config/ # Runtime configuration examples
│ ├── default.ini # Default settings
│ ├── debug.ini # Debug settings (verbose logging)
│ └── high_performance.ini # Production-optimized settings
├── scripts/ # Utility scripts
│ └── export_mnist_weights.py # PyTorch weight export tool
├── docs/ # User documentation
│ ├── en/ # English documentation (7 files)
│ ├── zh/ # Chinese documentation (7 files)
│ └── releases/ # Release notes
├── .claude/ # Claude Code configuration
│ ├── commands/opsx/ # OpenSpec slash commands
│ ├── skills/openspec/ # OpenSpec skill
│ └── settings.json # Claude Code settings
├── CMakeLists.txt # Build configuration
├── CMakePresets.json # Build presets
├── AGENTS.md # This file - AI workflow instructions
└── CLAUDE.md # Claude Code project instructions
Architecture Overview
The project follows a 4-layer architecture:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
┌─────────────────────────────────────────────────────────────────────┐
│ Application Layer │
│ Benchmark │ MNIST Demo │ Tests │ Your Application │
├─────────────────────────────────────────────────────────────────────┤
│ Engine Layer │
│ InferenceEngine │ Tensor │ AutoTuner │ Profiler │
├─────────────────────────────────────────────────────────────────────┤
│ Kernel Layer │
│ Naive │ Tiled │ Coalesced │ DoubleBuffer │ Optimized │ Fused │
│ Vectorized │ Half-Precision │ Batched │ cuBLAS wrapper │
├─────────────────────────────────────────────────────────────────────┤
│ Infrastructure Layer │
│ MemoryPool │ StreamManager │ Logger │ Config │
└─────────────────────────────────────────────────────────────────────┘
GEMM Optimization Path
| Level | Technique | vs cuBLAS | Key Optimization |
|---|---|---|---|
| 1 | Naive | ~10% | Baseline: each thread computes one element |
| 2 | Tiled | ~20% | Shared memory 32×32 tiles |
| 3 | Coalesced | ~30% | Warp-level memory coalescing |
| 4 | Double Buffer | ~40% | Prefetching to hide latency |
| 5 | Register Blocked | ~65% | Register tiling for compute density |
| 6 | Fused | ~75% | MatMul+Bias+ReLU in single kernel |
| 7 | Vectorized | ~85% | float4 vectorized loads |
Build Commands
Prerequisites
- CUDA Toolkit 11.0 or higher
- CMake 3.18 or higher
- C++17 compatible compiler (GCC 9+, MSVC 2019+)
- NVIDIA GPU with compute capability 7.0+
Build Presets
1
2
3
4
5
6
7
8
9
10
11
# Debug build with tests (recommended for development)
cmake --preset default
cmake --build --preset default
# Release build without tests (for benchmarking)
cmake --preset release
cmake --build --preset release
# CI build (Release + tests)
cmake --preset ci
cmake --build --preset ci
Manual Build (without presets)
1
2
3
4
5
6
7
8
# Configure
cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTS=ON
# Build
cmake --build build --parallel $(nproc)
# Install (optional)
cmake --install build --prefix /usr/local
Build Options
| Option | Default | Description |
|---|---|---|
BUILD_TESTS |
ON | Build test suite with Google Test |
ENABLE_FAST_MATH |
ON (Release) | Enable --use_fast_math for CUDA |
CMAKE_CUDA_ARCHITECTURES |
75;80;86;89;90 | Target GPU architectures |
Test Commands
1
2
3
4
5
6
7
8
# Run all tests (requires NVIDIA GPU)
ctest --preset default
# Run with verbose output on failure
ctest --preset default --output-on-failure
# Alternative: run test binary directly
./build/tests
Note: Tests require an NVIDIA GPU with CUDA support and cannot run on standard CI runners without GPU access.
Code Style Guidelines
Formatting
- Follow
.clang-formatconfiguration (Google-based style) - 4-space indentation, 100 column limit
- Use
clang-format --style=file -i <file>to format files
1
2
3
4
5
6
# Format a single file
clang-format --style=file -i src/my_file.cu
# Format all source files
find src include tests -name "*.cpp" -o -name "*.cu" -o -name "*.h" -o -name "*.cuh" | \
xargs clang-format --style=file -i
Naming Conventions
1
2
3
4
5
6
7
8
9
10
11
12
// C++ naming
class ClassName; // PascalCase
void function_name(); // snake_case
int variable_name; // snake_case
const int CONSTANT_NAME; // UPPER_SNAKE_CASE
int member_variable_; // snake_case with trailing underscore
// CUDA specifics
__global__ void my_kernel(); // snake_case
template<int BLOCK_SIZE> // Template parameters: UPPER_SNAKE_CASE
__shared__ float s_data[256]; // Prefix: s_ for shared memory
float r_sum = 0.0f; // Prefix: r_ for registers
Code Organization Rules
-
Explicit source list in CMake: Do NOT use
GLOB_RECURSEfor source files. Add new files to the explicit list inCMakeLists.txt. -
Header-only templates: CUDA kernel templates should be in
.cuhheaders when used across multiple translation units. -
Error handling: Use
CUDA_CHECK()andCUBLAS_CHECK()macros for all CUDA API calls. -
RAII resources: Use
DeviceMemoryorPooledMemoryfor automatic GPU memory management.
Testing Strategy
Test Categories
- Correctness Tests - Verify numerical accuracy against CPU reference
- Equivalence Tests - All GEMM variants produce identical results
- Property Tests - Invariants hold across random inputs
- Integration Tests - End-to-end inference scenarios
Writing Tests
Tests use Google Test framework. Example pattern:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#include <gtest/gtest.h>
#include "common.h"
#include "kernels.cuh"
using namespace mini_inference;
class MyTest : public ::testing::Test {
protected:
void SetUp() override {
CUDA_CHECK(cudaSetDevice(0));
}
};
TEST_F(MyTest, TestName) {
// Arrange
DeviceMemory d_A(size);
// Act
launch_kernel(...);
CUDA_CHECK(cudaDeviceSynchronize());
// Assert
EXPECT_LT(max_error, 1e-5f);
}
Commit Style
Follow Conventional Commits:
| Type | Description |
|---|---|
feat: |
New features |
fix: |
Bug fixes |
docs: |
Documentation changes |
perf: |
Performance improvements |
refactor: |
Code refactoring |
test: |
Test changes |
chore: |
Build/tool changes |
Include requirement IDs from specs when applicable:
1
2
3
4
5
6
7
feat(gemm): implement tiling optimization
- Implement 32x32 tile blocking (R2.1)
- Add shared memory loading (R2.2, R2.3)
- Handle boundary conditions (R2.4)
Closes #42
Security Considerations
- GPU Memory: Sensitive data in GPU memory is not automatically cleared. Call
zero()on buffers handling sensitive information. - File I/O: Weight files use a custom binary format with magic bytes for validation. Always validate file headers before loading.
- CUDA Errors: All CUDA API errors throw exceptions that must be caught at appropriate boundaries.
Deployment
GitHub Pages (Documentation)
Documentation is automatically deployed to GitHub Pages via Jekyll when changes are pushed to docs/, index.md, _config.yml, or Gemfile.
CI/CD
.github/workflows/ci.yml- Build verification on push/PR.github/workflows/pages.yml- Documentation deployment
CI runs:
- Build tests (Debug and Release)
- Code format verification with clang-format
- Documentation structure validation
Note: GPU tests are disabled in CI because standard runners lack NVIDIA GPUs. Tests must be run locally or on self-hosted GPU runners.
Quick Reference
Essential Commands
1
2
3
4
5
6
7
8
9
10
11
12
13
# Full development cycle
cmake --preset default && cmake --build --preset default && ctest --preset default
# Format all code
find src include tests -name "*.cpp" -o -name "*.cu" -o -name "*.h" -o -name "*.cuh" | \
xargs clang-format --style=file -i
# Run benchmark
./build-release/benchmark
# Run MNIST demo
python scripts/export_mnist_weights.py -o mnist_weights.bin
./build-release/mnist_demo mnist_weights.bin
Key Files for Common Tasks
| Task | File(s) |
|---|---|
| Add new kernel | include/kernels.cuh, src/, add to CMakeLists.txt |
| Update build | CMakeLists.txt, CMakePresets.json |
| Update docs | docs/en/, docs/zh/, index.md |
| Update specs | openspec/specs/product/, openspec/specs/architecture/ |
| Update config schema | include/config.h, config/*.ini |
Why This Matters
OpenSpec Benefits
- Structured changes: Every change has proposal, design, tasks
- Spec versioning: Delta specs track what changed and why
- Verification: Check implementation matches specs before archiving
- Traceability: Archive preserves history with timestamps
Preventing AI Hallucinations
AI tends to “freestyle” without proper context. Forcing it to read /openspec/specs/ first anchors its thinking scope.
Constraining Modification Paths
Using /opsx:propose before coding ensures document-code synchronization forever.