Changelog

All notable changes to this project are documented here.

The format follows Keep a Changelog and the project aims to follow Semantic Versioning.

[Unreleased]

Changed

Consolidated repository governance around openspec/specs/, updated agent instructions, and simplified documentation roles.
Reworked README, GitHub Pages content, and supporting docs into clearer repository-entry and learning surfaces.
Began pruning redundant release-history and engineering guidance artifacts in favor of fewer authoritative files.

[2.1.0] - 2026-04-16

Added

Tensor Core WMMA SGEMM kernel with guarded FP32 fallback for unsupported dimensions
Benchmark enhancements, including roofline data export and configurable warmup/benchmark iterations
Google Test coverage for standard kernels, Tensor Core fast path, fallback behavior, and edge cases
Bilingual documentation and a GitHub Pages documentation site

Changed

Consolidated source code into src/kernels/, src/utils/, and tests/
Adopted CMake as the primary build system while retaining the Makefile for quick local runs
Expanded supported CUDA architecture targets to cover Volta through Hopper generation GPUs

Fixed

Tensor Core path memory management issues
Double-buffer synchronization issues
Grid dimension handling for non-square matrices

[2.0.0] - 2026-03-13

Added

Bank-conflict-free and double-buffer SGEMM kernels
CUDA Events-based benchmark infrastructure
Nsight-oriented profiling support

Changed

Migrated from an earlier single-file layout to the current modular structure
Standardized on CUDA 11.0+ and C++17

Removed

Legacy single-file benchmark script
SM 6.x support

[1.0.0] - 2025-02-13

Added

Initial naive and tiled SGEMM kernels
Basic cuBLAS correctness verification
First benchmark CLI