EnglishCore ConceptsCore Algorithms

Core Algorithms

fq-compressor uses a hybrid compression strategy optimized for different data streams.

Overview

StreamAlgorithmDescription
SequenceABCAssembly-based Compression
QualitySCMStatistical Context Mixing
IDDelta+TokenTokenization and delta encoding

ABC: Assembly-based Compression

Principle

Treats reads as genome fragments rather than independent strings. Exploits biological redundancy by:

  1. Grouping similar reads
  2. Generating consensus sequences
  3. Encoding differences (deltas)

Pipeline

Input Reads

1. Minimizer Indexing (k=21, w=11)

2. Bucketing (by shared minimizers)

3. Reordering (TSP approximation)

4. Consensus Generation

5. Delta Encoding

Compressed Output

1. Minimizer Indexing

A minimizer is the lexicographically smallest k-mer in a window of w consecutive k-mers.

for each window of w k-mers:
    minimizer = min(window)
    index[minimizer].add(read_id)

Parameters:

  • k = 21: k-mer size
  • w = 11: window size

Benefits:

  • Reduces memory vs storing all k-mers
  • Maintains sensitivity for similar reads
  • Biologically relevant (minimizers cluster)

2. Bucketing

Reads sharing minimizers are grouped. Target bucket size: ~10,000 reads.

Why:

  • Same genomic region → shared minimizers
  • Local consensus more accurate than global
  • Enables parallel per-bucket processing

3. Reordering (TSP)

Greedy approximation to minimize edit distance between consecutive reads:

def reorder(bucket):
    ordered = [bucket.pop_random()]
    while bucket:
        last = ordered[-1]
        next_read = min(bucket, key=lambda r: edit_distance(last, r))
        ordered.append(next_read)
    return ordered

Complexity: O(n²) per bucket

4. Consensus Generation

Position-wise majority vote from aligned reads:

def generate_consensus(ordered_reads):
    alignment = multi_align(ordered_reads)
    consensus = []
    for position in alignment.columns:
        counts = Counter(position.bases)
        consensus.append(counts.most_common(1)[0][0])
    return ''.join(consensus)

5. Delta Encoding

Encode reads as edits from consensus:

enum class EditType { Match, Subst, Insert, Delete };
 
struct Edit {
    EditType type;
    uint32_t position;
    char base;  // For Subst/Insert
};

Example:

Consensus: ACGTACGTACGTACGT
Read:      ACGTAAGTACGTACGT

Edits: [6]: Subst C → A

Ratio: Typically 3-5× for genomic sequences


SCM: Statistical Context Mixing

Principle

High-order context modeling with arithmetic coding for quality scores.

Context Modeling

Predicts next quality value based on:

  • Previous 2 QVs
  • Current sequence base
  • Position in read
struct Context {
    uint8_t qv_minus_1;
    uint8_t qv_minus_2;
    char current_base;
    uint8_t position_bucket;
};

Model Mixing

Combines predictions from multiple context orders:

# Order-0: P(Q_i)
# Order-1: P(Q_i | Q_{i-1})
# Order-2: P(Q_i | Q_{i-1}, Q_{i-2})
# Order-s: P(Q_i | sequence_context)
 
final = sum(weight[i] * model[i].predict(context[i])
            for i in range(num_models))

Arithmetic Coding

Converts predictions to compressed bits:

Q values: [33, 34, 35, 36, 37]
Counts:   [100, 200, 400, 200, 100]

P(35) = 400/1000 = 0.4
Bits(35) = -log2(0.4) ≈ 1.32 bits

Quality Binning (Optional)

Lossy compression via binning:

MethodBinsBits/QVImpact
Lossless94~5None
Illumina 883Minimal
4-bin42Moderate

ID Compression

Tokenization

Split read IDs on pattern transitions:

@E150035817L1C001R0010000000/1
→ ['E', '150035817', 'L', '1', 'C', '001', 'R', '001', '0000000', '/', '1']

Delta Encoding

For monotonic numeric tokens:

Original: [001, 002, 003, 004, 005]
Delta:    [001, 001, 001, 001, 001]

Dictionary

Build dictionary of common tokens for frequent patterns.


Performance Summary

AlgorithmSpeedRatioMemoryParallel
ABC~10 MB/s3-5×MediumBucket-level
SCM~15 MB/s2-3×LowYes
ID~50 MB/s5-10×LowYes

Combined: ~11 MB/s compression, ~4× typical ratio