Skip to content

Design Document

1. Design Principles

  • Measurement-driven: Performance optimization must be based on benchmarks/data.
  • Low complexity first: Prefer removing inefficiencies through parameters, data structures, and copy path elimination.
  • Reproducible engineering: Build/test/release processes can be consistently reproduced in CI and locally.

2. Data Model and IO

2.1 FastqRecord and FastqBatch

  • Goal: Reduce string copies and frequent allocations.
  • Implementation:
  • fq::io::FastqRecord uses std::string_view to hold field views.
  • fq::io::FastqBatch maintains a continuous buffer_ and records_ metadata array.
  • Reader parses results into FastqBatch, upstream processing only reads/writes FastqRecord (views).

2.2 Reader/Writer

  • Reader: Batch reading and parsing, key parameters:
  • readChunkBytes / zlibBufferBytes / maxBufferBytes
  • Writer: Batch writing, key parameters:
  • zlibBufferBytes / outputBufferBytes

3. Processing Pipeline

3.1 Serial and Parallel Implementation

  • SequentialProcessingPipeline::run(): Selects based on threadCount:
  • processSequential(): Single thread (for debugging, small files)
  • processWithTBB(): tbb::parallel_pipeline

3.2 Parallel Pipeline Pattern (TBB)

  • source (serial_in_order): Read next FastqBatch.
  • processing (parallel): Apply predicates/mutators to batch and collect statistics.
  • sink (serial_in_order): Write in order and merge statistics.

3.3 Tuning Entry Points (Performance Priority)

  • Primary:
  • batchSize / batchCapacityBytes / maxInFlightBatches / threadCount
  • Secondary (requires evidence):
  • Specialized algorithm optimization, reducing critical path copies
  • Not default targets:
  • High-coupling optimizations like SIMD/prefetching/PGO

4. Logging

  • Uses global configuration from spdlog (no additional logging framework needed).
  • Externally provides fq::logging::init() / setLevel() as unified entry points; CLI calls after parsing --log-level.

5. Testing and Quality

5.1 Coverage

  • scripts/core/test --coverage: Generate lcov and HTML.
  • CI: Upload coverage artifacts.
  • Threshold gating: Reserved as a future option to avoid "gamifying for threshold coverage".

5.2 Static Analysis

  • Format checking is mandatory.
  • clang-tidy/cppcheck: Report first, then converge.

6. Release

  • scripts/tools/package-release: Locally generate dist/*.tar.gz.
  • .github/workflows/release.yml: Tag-triggered, performs packaging and uploads to GitHub Release.