Third Round Optimization (2026-03-10)

Code quality, performance, and documentation improvements.

Changes

Performance

  • Sobel kernel: Moved per-thread local Sobel weight arrays to __constant__ memory — eliminates redundant local memory allocation across all threads
  • Gaussian blur: Replaced #define TILE_SIZE / #define MAX_KERNEL_RADIUS with static constexpr — type-safe, no namespace pollution

Bug Fixes

  • Pipeline findInputOutputNodes(): Fixed bug where manually-set input nodes (via setInput()) with dependencies were cleared on each execute() call — now preserves them
  • Pipeline execute(): Merged two redundant validation loops (null operator check + input validity) into a single coherent pass
  • DAGScheduler destructor: Added null checks before cudaStreamDestroy / cudaEventDestroy to prevent undefined behavior if creation failed

Code Quality

  • MemoryManager: Simplified redundant tracking — replaced separate pinnedSizes_/pinnedFlags_/deviceSizes_ maps with unified MemoryBlock-based activePinnedAllocs_/activeDeviceAllocs_ maps
  • CMake: Added MSVC-compatible compile options via $<CXX_COMPILER_ID:MSVC> generator expressions (/O2, /W4)
  • CMake: Added testPresets to CMakePresets.jsonctest --preset default now works as documented

CI

  • ci.yml: Added ctest step after build (continue-on-error since tests require GPU)

Documentation

  • README.md: Fixed wrong include path (pipeline/pipeline.hpipeline.h) and outdated API in usage example
  • README.md: Added CI/Docs badges, operator table, GPU architecture table, expanded project structure with per-file descriptions, architecture diagram, engineering quality section
  • index.md: Fixed incorrect build paths in quick start section (build/release/build-release/)
  • .gitignore: Added .cache/ directory