enTaxonomic Profiling

Taxonomic Profiling

Complete guide to taxonomic classification and diversity analysis using MICOS-2024.


Table of Contents


Overview

The taxonomic profiling module identifies and quantifies the microbial species present in your metagenomic samples. It integrates quality control, sequence classification, and diversity analysis into a comprehensive pipeline.

Key Features

  • High-speed classification: Kraken2 processes ~10M reads/minute
  • Accurate abundance estimation: Bracken improves species-level quantification
  • Interactive visualization: Krona charts for exploring taxonomic composition
  • Diversity metrics: Alpha and beta diversity via QIIME2
  • Statistical analysis: Differential abundance testing

Methodology

Classification Algorithm (Kraken2)

Kraken2 uses k-mer based classification:

  1. Database Construction: Reference genomes are broken into k-mers (default k=35)
  2. Minimizer Indexing: Reduces memory usage by storing minimizers only
  3. Classification: Input reads matched against database using LCA (Lowest Common Ancestor) algorithm
  4. Confidence Scoring: Classification confidence based on k-mer coverage
Read: ATCGATCGATCGATCGATCGATCG...

K-mers: [ATCGATCGATCGATC], [TCGATCGATCGATCG], ...

Database lookup → LCA assignment → Confidence score

Abundance Correction (Bracken)

Bracken corrects for:

  • Genome size bias: Larger genomes contribute more reads
  • Database coverage: Incomplete reference databases
  • Strain variation: Multiple strains of same species

Taxonomic Ranks

RankCodeExample
DomainDBacteria
PhylumPProteobacteria
ClassCGammaproteobacteria
OrderOEnterobacterales
FamilyFEnterobacteriaceae
GenusGEscherichia
SpeciesSEscherichia coli

Workflow

Step-by-Step Process

  1. Quality Control: Remove host DNA and low-quality reads
  2. Classification: Assign reads to taxa using Kraken2
  3. Correction: Estimate true abundances with Bracken
  4. Visualization: Generate Krona interactive charts
  5. Diversity Analysis: Calculate alpha/beta diversity metrics

Input Requirements

Data Format

FormatExtensionPaired/Single
FASTQ (gzip).fastq.gzPaired or Single
FASTQ.fastqPaired or Single

Naming Convention

Paired-end:
  Sample1_R1.fastq.gz
  Sample1_R2.fastq.gz

Single-end:
  Sample1.fastq.gz

Quality Requirements

MetricMinimumRecommended
Read length≥ 50 bp≥ 100 bp
Quality score (Q30)> 70%> 85%
Depth per sample1M reads10M+ reads

Database Requirements

DatabaseSizeDownload
Kraken2 Standard~70 GBAWS Index
Kraken2 PlusPF~100 GBIncludes fungi & protozoa
MiniKraken2~8 GBFor testing

Running the Analysis

Option 1: Complete Pipeline

python -m micos.cli full-run \
  --input-dir data/raw_input \
  --results-dir results \
  --threads 16 \
  --kneaddata-db /path/to/kneaddata_db \
  --kraken2-db /path/to/kraken2_db

Option 2: Taxonomic Profiling Only

python -m micos.cli run taxonomic-profiling \
  --input-dir data/cleaned_reads \
  --output-dir results/taxonomy \
  --threads 16 \
  --kraken2-db /path/to/kraken2_db \
  --confidence 0.1

Option 3: Manual Steps

# Step 1: Run Kraken2
kraken2 --db /path/to/kraken2_db \
  --paired sample_R1.fastq sample_R2.fastq \
  --output sample.kraken \
  --report sample.report \
  --confidence 0.1 \
  --threads 16
 
# Step 2: Run Bracken
bracken -d /path/to/kraken2_db \
  -i sample.report \
  -o sample.bracken \
  -r 150 \
  -l S
 
# Step 3: Convert to BIOM
kraken-biom *.report --fmt hdf5 -o feature-table.biom
 
# Step 4: Generate Krona
ktImportTaxonomy -o sample.krona.html sample.report

Parameter Configuration

Kraken2 Parameters

taxonomic_profiling:
  kraken2:
    # Classification confidence (0.0 - 1.0)
    # Higher = more stringent, fewer classified reads
    confidence: 0.1
    
    # Minimum base quality for k-mer matching
    min_base_quality: 20
    
    # Minimum hit groups required
    min_hit_groups: 2
    
    # Use memory-mapped database (faster, more RAM)
    memory_mapping: true
    
    # Include scientific names in output
    use_names: true

Confidence Threshold Selection

ValueSensitivityPrecisionUse Case
0.0Very HighLowerExploratory analysis
0.1HighGoodDefault, balanced
0.3ModerateHighConservative analysis
0.5LowVery HighHigh confidence only

Bracken Parameters

taxonomic_profiling:
  bracken:
    enabled: true
    
    # Read length (matches your data)
    read_length: 150
    
    # Taxonomic level for abundance
    # D, P, C, O, F, G, S
    level: "S"
    
    # Threshold for Bracken re-estimation
    threshold: 10

Output Files

Directory Structure

results/taxonomic_profiling/
├── raw/
│   ├── sample1.kraken      # Raw classification output
│   └── sample1.report      # Taxonomic report
├── bracken/
│   └── sample1.bracken     # Corrected abundances
├── krona/
│   └── sample1.krona.html  # Interactive visualization
└── biom/
    └── feature-table.biom  # QIIME2-compatible table

Kraken2 Report Format

ColumnDescriptionExample
1Percentage of reads56.32
2Number of reads (clade)563200
3Number of reads (direct)10000
4Rank codeS
5NCBI TaxID562
6Scientific nameEscherichia coli

Rank codes: D=Domain, P=Phylum, C=Class, O=Order, F=Family, G=Genus, S=Species

Bracken Output Format

ColumnDescription
nameTaxonomic name
taxonomy_idNCBI TaxID
taxonomy_lvlTaxonomic level
kraken_assigned_readsRaw Kraken2 counts
added_readsEstimated additional reads
new_est_readsTotal estimated reads
fraction_totalProportion of total

BIOM Format

The BIOM file contains:

  • Observation IDs: Taxa (e.g., k__Bacteria; p__Proteobacteria; …)
  • Sample IDs: Your sample names
  • Data: Count matrix (samples × taxa)

Result Interpretation

Quality Metrics

Classification Rate

# Calculate classification percentage
# Look for: >70% classified at species level
 
grep -E "^\s+[0-9]+\s+[0-9]+\s+[0-9]+\s+[URS]" sample.report | \
  awk '{sum+=$2} END {print "Classified reads:", sum}'
Classification RateInterpretationAction
< 50%High unclassifiedCheck database coverage, consider lower confidence
50-70%ModerateAcceptable for most analyses
70-90%GoodStandard performance
> 90%ExcellentHigh-quality data and good database coverage

Taxonomic Composition

# Top genera
grep "^G" sample.report | sort -k1,1nr | head -10
 
# Top species
grep "^S" sample.report | sort -k1,1nr | head -10

Krona Visualization

The Krona HTML allows:

  • Exploring taxonomic hierarchy: Click to zoom in/out
  • Comparing samples: Multiple reports in one chart
  • Exporting: SVG/PNG export for publication

Diversity Metrics

Alpha Diversity (Within Sample)

MetricMeasuresRangeNotes
ShannonDiversity (richness + evenness)0 - ~7Higher = more diverse
Chao1Richness estimator0+Sensitive to rare taxa
SimpsonEvenness0-1Lower = more even
Observed FeaturesRaw richness0+Number of taxa

Typical values for human gut:

  • Shannon: 2.5-4.5
  • Observed species: 50-200

Beta Diversity (Between Samples)

MetricTypeUse Case
Bray-CurtisAbundance-basedCommunity composition
JaccardPresence/absenceSpecies overlap
UniFracPhylogeneticEvolutionary distance
Weighted UniFracAbundance + phylogenyCombined view

Advanced Topics

Custom Database Creation

# Build custom Kraken2 database
kraken2-build --download-taxonomy --db custom_db
kraken2-build --add-to-library custom_genome.fa --db custom_db
kraken2-build --build --threads 16 --db custom_db

Strain-Level Analysis

# Use higher-confidence threshold for strain analysis
taxonomic_profiling:
  kraken2:
    confidence: 0.3
    use_names: true

Time-Series Analysis

# Run with metadata for longitudinal analysis
python -m micos.cli run diversity-analysis \
  --input-biom feature-table.biom \
  --metadata metadata.tsv \
  --output-dir results/diversity

Troubleshooting

Issue: Low Classification Rate

Symptoms: <50% reads classified

Solutions:

# 1. Lower confidence threshold
taxonomic_profiling:
  kraken2:
    confidence: 0.05
 
# 2. Use larger database
# Switch from MiniKraken to Standard
 
# 3. Check data quality
# Run FastQC on input reads

Issue: Database Loading Slow

Symptoms: Kraken2 takes long time to start

Solutions:

# Use memory mapping
taxonomic_profiling:
  kraken2:
    memory_mapping: true
 
# Or use SSD for database storage

Issue: Empty Krona Charts

Symptoms: Krona HTML shows no data

Solutions:

# Check Kraken2 report format
head -5 sample.report
 
# Regenerate Krona
ktImportTaxonomy sample.report -o sample.krona.html

See Also