Changelog
All notable changes to this project are documented here.
The format follows Keep a Changelog and the project aims to follow Semantic Versioning.
[Unreleased]
Changed
- Consolidated repository governance around
openspec/specs/, updated agent instructions, and simplified documentation roles. - Reworked README, GitHub Pages content, and supporting docs into clearer repository-entry and learning surfaces.
- Began pruning redundant release-history and engineering guidance artifacts in favor of fewer authoritative files.
[2.1.0] - 2026-04-16
Added
- Tensor Core WMMA SGEMM kernel with guarded FP32 fallback for unsupported dimensions
- Benchmark enhancements, including roofline data export and configurable warmup/benchmark iterations
- Google Test coverage for standard kernels, Tensor Core fast path, fallback behavior, and edge cases
- Bilingual documentation and a GitHub Pages documentation site
Changed
- Consolidated source code into
src/kernels/,src/utils/, andtests/ - Adopted CMake as the primary build system while retaining the Makefile for quick local runs
- Expanded supported CUDA architecture targets to cover Volta through Hopper generation GPUs
Fixed
- Tensor Core path memory management issues
- Double-buffer synchronization issues
- Grid dimension handling for non-square matrices
[2.0.0] - 2026-03-13
Added
- Bank-conflict-free and double-buffer SGEMM kernels
- CUDA Events-based benchmark infrastructure
- Nsight-oriented profiling support
Changed
- Migrated from an earlier single-file layout to the current modular structure
- Standardized on CUDA 11.0+ and C++17
Removed
- Legacy single-file benchmark script
- SM 6.x support
[1.0.0] - 2025-02-13
Added
- Initial naive and tiled SGEMM kernels
- Basic cuBLAS correctness verification
- First benchmark CLI