Skip to content

Statistics API

Namespace: fq::statistic


StatisticCalculatorInterface

Abstract interface for statistical calculation tasks, create instances via factory function.

Factory Creation

cpp
fq::statistic::StatisticOptions options;
options.inputFastqPath = "input.fastq.gz";
options.outputStatPath = "output.stat.txt";
options.signatureReportPath = "output.signatures.tsv";
options.threadCount = 4;

auto calculator = fq::statistic::createStatisticCalculator(options);
calculator->run();

Interface Definition

cpp
class StatisticCalculatorInterface {
public:
    virtual ~StatisticCalculatorInterface() = default;
    virtual void run() = 0;
};

StatisticOptions

Statistics task configuration.

FieldTypeDescription
inputFastqPathstd::stringInput FASTQ file path
outputStatPathstd::stringOutput statistics file path
signatureReportPathstd::stringOptional signature sidecar path
batchSizeuint32_tNumber of records processed per batch
signatureKmerSizesize_tHead-kmer size used by the sidecar
maxReportedSignaturessize_tMaximum number of signature rows to emit
duplicateEstimateSampleModulosize_tSampling modulo used for duplicate estimation
threadCountuint32_tNumber of threads
executionBackendExecutionBackendCurrently supports OneTbb
memoryResourcePolicyMemoryResourcePolicyCurrently supports ObjectPool
allocationTelemetryEnabledboolEmits memory telemetry headers in the text report
readChunkBytessize_tReader chunk size
zlibBufferBytessize_tzlib internal buffer size
batchCapacityBytessize_tPer-batch buffer capacity
memoryLimitBytessize_tMemory budget used to resolve in-flight batches
maxInFlightBatchessize_tExplicit in-flight batch cap (0 means auto)
qualityEncodingintQuality encoding offset (typically 33)

StatisticInterface

Low-level abstract interface for statistical calculation, supports custom statistics logic extension.

cpp
class StatisticInterface {
public:
    using Batch = fq::io::FastqBatch;
    using Result = FqStatisticResult;

    virtual ~StatisticInterface() = default;
    virtual auto calculateStats(const Batch& batch) -> Result = 0;
};

FqStatisticResult

The public header only forward-declares FqStatisticResult; the table below describes the current result semantics used by workers and aggregators rather than a stable ABI layout.

FieldTypeDescription
readCountuint64_tTotal read count
totalBasesuint64_tTotal base count
maxReadLengthuint32_tMaximum read length
duplicateSampledReadsuint64_tNumber of sampled duplicate reads
posQualityDistvector<uint64_t>Flattened position-quality distribution
posBaseDistvector<uint64_t>Flattened position-base distribution
headKmerCountsmap<string, uint64_t>Bounded head-kmer signature counts

Supports operator+= to merge statistics results from multiple batches.


Parallel Processing Architecture

Statistical analysis uses TBB parallel pipeline:

Input file → FastqReader → [Source] → [Processing] → [Aggregation] → Output file
                          Serial read  Parallel calc    Serial merge
  1. Source (serial_in_order): Serially read FastqBatch
  2. Processing (parallel): Parallelly calculate statistics for each batch
  3. Aggregation (serial_in_order): Merge all statistics results

Output Format

Statistics results write a text report by default and may optionally emit a TSV sidecar, containing:

  • Total reads, max read length, and total bases
  • Base composition (A/T/C/G/N ratio)
  • GC content
  • Position quality distribution (Q20/Q30 percentage)
  • Per-position base counts, average quality, and estimated error rate
  • Duplicate estimate (sampling-based)
  • Bounded head-kmer signatures (enabled via --signature-report)

When --signature-report is enabled, the sidecar uses this TSV shape:

text
metric	key	count
summary	total_reads	<count>
summary	duplicate_estimate	<count>
head_kmer	<kmer>	<count>

MIT License © LessUp