Core Algorithms

fq-compressor adopts a "Best of Both Worlds" hybrid strategy: each FASTQ data stream is compressed with a specialized algorithm optimized for its characteristics.

1. Sequence: Assembly-based Compression (ABC)

Unlike traditional compressors that treat reads as independent strings, fq-compressor treats them as fragments of an underlying genome.

How It Works

Raw Reads ──► Minimizer Bucketing ──► Read Reordering ──► Local Consensus ──► Delta Encoding

Step 1: Minimizer Bucketing

Reads are grouped into buckets based on shared minimizers (signature k-mers). Reads sharing a minimizer are likely from the same genomic region.

Step 2: Read Reordering

Inside each bucket, reads are reordered to form an approximate Hamiltonian path, minimizing the edit distance between neighboring reads.

Step 3: Local Consensus Generation

A local consensus sequence is generated for each bucket, representing the "average" read in that group.

Step 4: Delta Encoding

Only the edits (substitutions, insertions, deletions) relative to the consensus are stored. Since reads in the same bucket are highly similar, deltas are very small.

Read Length Classification

The ABC strategy is most effective for short reads. fq-compressor automatically classifies reads:

Type	Condition	Strategy
Short	max ≤ 511 bp, median < 1 KB	Full ABC with global reordering
Medium	max > 511 bp or 1 KB ≤ median < 10 KB	Zstd, reordering disabled
Long	median ≥ 10 KB or max ≥ 10 KB	Zstd, reordering disabled

References

Spring: Chandak, S., et al. (2019). "SPRING: a next-generation FASTQ compressor". Bioinformatics.
Mincom: Liu, Y., et al. "Mincom: hash-based minimizer bucketing for genome sequence compression".
HARC: Chandak, S., et al. (2018). "HARC: Haplotype-aware Reordering for Compression".

2. Quality Scores: Statistical Context Mixing (SCM)

Quality scores are noisy and don't "assemble" well — they require a fundamentally different approach.

How It Works

Quality Values ──► Context Modeling ──► Prediction ──► Arithmetic Coding

Context Modeling

The compressor predicts the next quality value based on a context composed of:

Previous 2 quality values
Current sequence base (A/C/G/T)
Position within the read

Adaptive Arithmetic Coding

Predictions are fed into an adaptive arithmetic coder that approaches the entropy limit of the quality stream.

Why Not Compress Quality with ABC?

Quality scores are inherently noisy and don't exhibit the same biological redundancy as DNA sequences. Context-based statistical modeling is more effective because:

Quality scores are position-dependent
They correlate with the underlying base call
Neighboring quality values are strongly correlated

References

fqzcomp5: Bonfield, J. K. "Compression of FASTQ and SAM format sequencing data".

3. Identifiers: Tokenization + Delta

FASTQ read identifiers follow structured patterns (e.g., @INSTRUMENT:RUN:FLOWCELL:LANE:TILE:X:Y).

How It Works

Tokenization — Split identifiers into typed tokens (instrument, run ID, coordinates, etc.)
Delta Encoding — For numeric fields, store only the difference from the previous read
Backend Compression — Tokenized streams are compressed with a general-purpose compressor

This approach exploits the highly regular structure of Illumina-style identifiers, where consecutive reads differ only in coordinate fields.

Core Algorithms

Core Algorithms

1. Sequence: Assembly-based Compression (ABC)

How It Works

Step 1: Minimizer Bucketing

Step 2: Read Reordering

Step 3: Local Consensus Generation

Step 4: Delta Encoding

Read Length Classification

References

2. Quality Scores: Statistical Context Mixing (SCM)

How It Works

Context Modeling

Adaptive Arithmetic Coding

Why Not Compress Quality with ABC?

References

3. Identifiers: Tokenization + Delta

How It Works

results matching ""

No results matching ""