Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[Unreleased]

Added

None

Changed

None

Fixed

None

1.1.0 - 2025-04-21

🎉 Highlights

Complete Documentation Rewrite

This release focuses on documentation improvements and internationalization:

Bilingual Documentation: Complete documentation now available in both English and Chinese
Professional Documentation Structure: Reorganized docs with en/ and zh/ directories
Improved Navigation: Added comprehensive navigation and cross-linking
Release Notes: All releases now have bilingual release notes

📖 Documentation

Added

New bilingual documentation structure (docs/en/ and docs/zh/)
English documentation for all core documents:
- QUICK_START.md - Quick start guide
- ARCHITECTURE.md - System architecture
- GEMM_OPTIMIZATION.md - Optimization techniques
- PERFORMANCE_TUNING.md - Performance tuning guide
- API_REFERENCE.md - Complete API reference
- CONTRIBUTING.md - Contribution guidelines
New docs/README.md as documentation entry point
Enhanced README.md and README.zh-CN.md with improved structure
Updated all release notes with bilingual support

Changed

Restructured docs/ directory for better organization
Improved code examples and formatting consistency
Added language switch links in all documents
Standardized document headers with version info

1.0.0 - 2025-04-16

🎉 Highlights

First Stable Release of Mini-Inference Engine

A complete CUDA GEMM optimization learning project demonstrating the full optimization path from naive matrix multiplication to near-cuBLAS performance.

🚀 7-Level Progressive GEMM Optimization

Level	Kernel	Performance	Key Technique
1	`naive_matmul`	~10%	Baseline implementation
2	`tiled_gemm`	~20%	32×32 shared memory tiling
3	`coalesced_gemm`	~30%	Memory coalescing + bank conflict avoidance
4	`double_buffer_gemm`	~40%	Double buffering for latency hiding
5	`optimized_gemm`	~70%	Register blocking (8×8 per thread)
6	`fused_gemm`	~80%	Kernel fusion (GEMM + Bias + ReLU)
7	`vectorized_gemm`	~85%	float4 vectorized loads

🔧 Complete Inference Engine

Multi-layer Neural Network Inference: Full forward pass support
Weight Loading and Management: Binary weight file format
MNIST Demonstration: End-to-end digit recognition example
Per-layer Timing Analysis: Detailed performance breakdown

📊 Professional Analysis Tools

Tool	Description
Profiler	GPU profiler with roofline analysis
AutoTuner	Automatic optimal kernel selection
Benchmark Suite	Comprehensive performance benchmarking

🛠️ Core Components

Component	Description
`InferenceEngine`	Multi-layer neural network inference
`Tensor`	N-dimensional tensor with GPU storage
`MemoryPool`	GPU memory pool with caching
`StreamManager`	CUDA stream management
`AutoTuner`	Automatic kernel selection
`Profiler`	Performance profiling
`Config`	Configuration system
`Logger`	Thread-safe logging

🧪 Quality Assurance

207 Unit Tests with Google Test
All tests passing on CI
Code coverage tracking

📚 Documentation

Bilingual documentation (English and Chinese)
Complete API reference
Architecture design documents
Performance tuning guide
GEMM optimization tutorials

0.2.0 - 2025-03-15

🚀 Added

Major GEMM Optimizations

Memory coalescing optimization
Double buffering technique
Register blocking optimization (~70% cuBLAS)
Kernel fusion (GEMM + Bias + ReLU)
Vectorized memory loads with float4
Half-precision (FP16) GEMM support
Batched GEMM operations

Infrastructure Components

Memory pool with caching
Stream manager for concurrency
Configuration system
Logging system
INT8 quantization support
Auto-tuner for kernel selection
Performance profiler

Build System

CMake Presets support
CUDA architecture auto-detection
Optional test building

0.1.0 - 2025-01-01

🚀 Added

Initial Release

Basic GEMM kernels (Naive, Tiled)
Core data structures (MatrixDesc, GemmConfig)
CUDA error handling utilities
DeviceMemory RAII wrapper
Basic benchmark framework
MNIST demo placeholder

Version History Summary

Version	Date	Highlights
1.1.0	2025-04-16	Documentation rewrite with bilingual support
1.0.0	2025-04-16	First stable release with complete features
0.2.0	2025-03-15	Advanced optimizations and infrastructure
0.1.0	2025-01-01	Initial release with basic functionality

Migration Guides

From 1.0.0 to 1.1.0

All APIs remain backward compatible. Documentation paths have changed:

Old: docs/QUICK_START.md
New: docs/en/QUICK_START.md or docs/zh/QUICK_START.md

From 0.x to 1.0.0

All APIs are backward compatible. New features are additive.

Recommended Changes:

// Old: Direct kernel launch
launch_gemm(A, B, C, M, N, K);

// New: Use AutoTuner for optimal performance
AutoTuner tuner;
tuner.execute_best(A, B, C, M, N, K, stream);

// Old: Manual memory management
float* d_ptr;
cudaMalloc(&d_ptr, size);
// ... use d_ptr ...
cudaFree(d_ptr);

// New: Use DeviceMemory or MemoryPool
DeviceMemory mem(size);  // RAII, auto cleanup
// or
PooledMemory mem(size);  // From pool, cached for reuse

Known Issues

Issue	Status	Workaround
Small matrix performance (< 256) may be suboptimal	Open	Use batched GEMM
FP16 GEMM requires Compute Capability 7.0+	By Design	Use FP32 kernels
Memory pool may hold memory longer than necessary	Open	Call `MemoryPool::clear_cache()`

Roadmap

v1.2.0 (Planned)

Tensor Core support (WMMA API)
Multi-GPU support
ONNX model loading
Asynchronous copy (Ampere+)

v2.0.0 (Long-term)

Transformer layer support
Convolution operations (im2col)
Graph optimization
Python bindings

Contributing

See CONTRIBUTING.md for guidelines.

License

MIT License