FQC File Format

The .fqc format is fq-compressor's custom archive format, designed for high compression, random access, and parallel processing.

Format Layout

┌─────────────────────────────────────────────┐
│  File Header (64 bytes)                      │
│  ├─ Magic Number (8 bytes): "FQC\x00\x01"   │
│  ├─ Version                                  │
│  ├─ Flags (PE, reordered, etc.)              │
│  └─ Global metadata                          │
├─────────────────────────────────────────────┤
│  Block 1 (e.g., 10 MB uncompressed)          │
│  ├─ Block Header (32 bytes)                  │
│  │   ├─ Block size / CRC32                   │
│  │   └─ Read count in block                  │
│  ├─ ID Stream (Delta + Tokenized)            │
│  ├─ Sequence Stream (ABC Encoded)            │
│  └─ Quality Stream (SCM Encoded)             │
├─────────────────────────────────────────────┤
│  Block 2                                     │
├─────────────────────────────────────────────┤
│  ...                                         │
├─────────────────────────────────────────────┤
│  Reorder Map (Optional)                      │
│  ├─ ZigZag varint encoded                    │
│  └─ Maps compressed order → original order   │
├─────────────────────────────────────────────┤
│  Block Index                                 │
│  ├─ Block ID → file offset mapping           │
│  └─ Enables O(1) random access               │
├─────────────────────────────────────────────┤
│  File Footer (32 bytes)                      │
│  ├─ Total read count                         │
│  ├─ Index offset                             │
│  └─ File checksum                            │
└─────────────────────────────────────────────┘

Design Rationale

Columnar Stream Separation

Within each block, data is physically separated into three independent streams:

Stream	Content	Compression
ID Stream	Read identifiers	Tokenization + Delta
Sequence Stream	DNA bases (A/C/G/T/N)	ABC or Zstd
Quality Stream	Phred quality scores	SCM arithmetic coding

This ensures each stream uses the optimal algorithm for its data type.

Block Independence

Each block is fully self-contained:

Compression models are reset at block boundaries
Blocks can be decompressed in parallel without dependencies
A corrupted block doesn't affect other blocks

Random Access

The block index at the end of the file maps logical read ranges to file offsets:

Read 0-9999     → Block 0 at offset 64
Read 10000-19999 → Block 1 at offset 1048640
Read 20000-29999 → Block 2 at offset 2097216
...

To access reads 15000-16000, fq-compressor:

Looks up Block 1 in the index — O(1)
Seeks to the block offset
Decompresses only Block 1
Extracts the requested reads

Reorder Map

When ABC reordering is enabled, the original read order is preserved in a Reorder Map:

Encoded with ZigZag varint for compact storage
Allows restoring the original FASTQ order after decompression
Can be skipped if order preservation is not needed

Why Not Use an Existing Format?

Format	Limitation
BAM/CRAM	Requires a reference genome; designed for aligned data
BGZF	Block-level compression only; no domain-specific modeling
gzip/xz	No random access; no columnar separation

The .fqc format combines the best ideas from these formats while adding domain-specific compression for raw FASTQ data.

FQC File Format

FQC File Format

Format Layout

Design Rationale

Columnar Stream Separation

Block Independence

Random Access

Reorder Map

Why Not Use an Existing Format?

results matching ""

No results matching ""