Functional Profiling
Complete guide to functional annotation and pathway analysis using HUMAnN within MICOS-2024.
Table of Contents
- Overview
- Methodology
- Input Requirements
- Running the Analysis
- Parameter Configuration
- Output Files
- Result Interpretation
- Downstream Analysis
- Advanced Topics
- Troubleshooting
Overview
Functional profiling characterizes the metabolic potential of microbial communities by quantifying gene families and metabolic pathways. Unlike taxonomic profiling which answers “who is there?”, functional profiling answers “what can they do?”
Key Features
- Gene family quantification: UniRef90 protein clusters
- Pathway analysis: MetaCyc metabolic pathways
- Species stratification: Attribute functions to specific taxa
- Multi-sample integration: Compare functional profiles across samples
Methodology
HUMAnN Analysis Pipeline
Input Reads
↓
[MetaPhlAn] → Taxonomic profile
↓
[Mapping] → Split reads by species
↓
[ChocoPhlAn] → Align to pangenomes (nucleotide)
↓
[UniRef90] → Align unmapped to proteins
↓
[Pathway Reconstruction] → MinPath + gap filling
↓
Gene Families + Pathway Abundance + CoverageAlgorithm Overview
-
Taxonomic Profiling (MetaPhlAn)
- Quickly identify species present
- Guide nucleotide search strategy
-
Nucleotide-level Search (ChocoPhlAn)
- Map reads to species-specific gene families
- High sensitivity for known species
-
Protein-level Search (UniRef90)
- Diamond alignment for unmapped reads
- Detect novel functions
-
Pathway Reconstruction (MinPath)
- Identify complete metabolic pathways
- Conservative: pathway must have all required reactions
Functional Ontologies
| Level | Database | Description |
|---|---|---|
| Gene Families | UniRef90 | Groups of protein sequences (>90% identical) |
| Pathways | MetaCyc | Curated metabolic pathways |
| Modules | KEGG | Functional units of metabolism |
Input Requirements
Input Data
| Source | Format | Description |
|---|---|---|
| KneadData output | FASTQ | Quality-controlled, host-removed reads |
| Any cleaned reads | FASTQ/FASTA | Adapter-trimmed, quality-filtered |
File Structure
input_dir/
├── sample1_paired_1.fastq.gz
├── sample1_paired_2.fastq.gz
├── sample1_unmatched_1.fastq.gz
├── sample1_unmatched_2.fastq.gz
├── sample2_paired_1.fastq.gz
└── ...Note: HUMAnN automatically detects paired and unpaired reads.
Database Requirements
| Database | Size | Description |
|---|---|---|
| ChocoPhlAn | ~10 GB | Nucleotide pangenome database |
| UniRef90 | ~20 GB | Protein families (>90% identity) |
| UniRef50 | ~5 GB | Reduced protein families (>50% identity) |
| MetaCyc | Included | Metabolic pathway definitions |
Storage tip: Use UniRef50 for faster processing, UniRef90 for comprehensive results.
Running the Analysis
Option 1: MICOS CLI
# Functional annotation only
python -m micos.cli run functional-annotation \
--input-dir results/quality_control/kneaddata \
--output-dir results/functional_annotation \
--threads 16
# As part of full pipeline
python -m micos.cli full-run \
--input-dir data/raw_input \
--results-dir results \
--threads 16 \
--kneaddata-db /db/kneaddata \
--kraken2-db /db/kraken2Option 2: Direct HUMAnN
# Basic run
humann --input sample.fastq \
--output output_dir/ \
--nucleotide-database /db/chocophlan \
--protein-database /db/uniref90 \
--threads 16
# Separate steps for control
# Step 1: Gene family quantification
humann --input sample.fastq \
--output output_dir/ \
--bypass-translated-search
# Step 2: Regroup to other ontologies
humann_regroup_table \
-i sample_genefamilies.tsv \
-g uniref90_ko \
-o sample_ko.tsv
# Step 3: Pathway annotation
humann --input sample.fastq \
--output output_dir/ \
--resume \
--gap-fill on \
--minpath onParameter Configuration
HUMAnN Configuration
functional_annotation:
enabled: true
humann:
enabled: true
threads: 16
# Database paths
nucleotide_database: "${paths.databases}/humann/chocophlan"
protein_database: "${paths.databases}/humann/uniref90"
# Search options
search_mode: "diamond" # diamond or usearch
# Sensitivity vs speed trade-off
diamond_options: "--mid-sensitive"
# Pathway options
pathway_coverage: true
gap_fill: true # Fill pathway gaps
minpath: true # Use MinPath for pathway selection
# Output options
remove_temp: trueDatabase Selection Guide
| Database | Time | Sensitivity | Use Case |
|---|---|---|---|
| ChocoPhlAn only | Fast | Species-dependent | Known species abundant |
| + UniRef50 | Moderate | Good | Balance speed/coverage |
| + UniRef90 | Slow | Best | Maximum sensitivity |
Performance Tuning
functional_annotation:
humann:
# Fast mode (for large datasets)
threads: 32
diamond_options: "--fast"
# Comprehensive mode (for publication)
threads: 16
diamond_options: "--sensitive"
gap_fill: true
pathway_coverage: trueOutput Files
Directory Structure
results/functional_annotation/
├── sample_genefamilies.tsv # Gene family abundances
├── sample_genefamilies-cpm.tsv # Normalized to CPM
├── sample_pathabundance.tsv # Pathway abundances
├── sample_pathcoverage.tsv # Pathway coverage
├── sample_pathabundance-cpm.tsv # Normalized pathway
└── sample.log # Run logGene Families File
| Column | Description | Example |
|---|---|---|
| # Gene Family | UniRef90 cluster | UniRef90_A0A0A0MQD6 |
| Sample1 | Abundance (RPK) | 45.23 |
| Sample2 | Abundance (RPK) | 12.56 |
Special entries:
UNMAPPED: Reads not matching any geneUNGROUPED: Genes not in any UniRef cluster
Pathway Abundance File
| Column | Description | Example |
|---|---|---|
| # Pathway | MetaCyc pathway | GLYCOLYSIS |
| Sample1 | Abundance | 145.2 |
Stratified format: PATHWAY|Species shows species contribution
Example:
GLYCOLYSIS|g__Bacteroides.s__Bacteroides_thetaiotaomicron 45.2
GLYCOLYSIS|g__Faecalibacterium.s__Faecalibacterium_prausnitzii 32.1
GLYCOLYSIS|unclassified 12.5Pathway Coverage File
Coverage indicates pathway completeness (0-1 scale):
| Coverage | Interpretation |
|---|---|
| 1.0 | Complete pathway present |
| 0.5-0.9 | Partial pathway |
| < 0.5 | Fragmented pathway |
Result Interpretation
Quality Metrics
Mapping Rate
# Check mapping statistics in log file
grep "total aligned" sample.log| Mapping Rate | Interpretation | Action |
|---|---|---|
| < 20% | Poor | Check data quality and database |
| 20-50% | Moderate | Acceptable for novel communities |
| 50-70% | Good | Standard performance |
| > 70% | Excellent | High-quality reference genomes |
Unmapped Reads
High unmapped rate may indicate:
- Novel taxa not in reference
- Viral sequences
- Host contamination (check QC)
- Low-quality sequencing
Functional Richness
# Count unique gene families
cut -f1 sample_genefamilies.tsv | tail -n +2 | wc -l
# Count pathways
cut -f1 sample_pathabundance.tsv | tail -n +2 | wc -lTypical ranges for human gut:
- Gene families: 100,000-500,000
- Pathways: 200-400
Top Functions
# Top gene families (after normalization)
sort -k2,2nr sample_genefamilies-cpm.tsv | head -20
# Top pathways
sort -k2,2nr sample_pathabundance-cpm.tsv | head -20Downstream Analysis
Normalization
# CPM (Copies Per Million) normalization
humann_renorm_table \
-i sample_genefamilies.tsv \
-o sample_genefamilies-cpm.tsv \
--units cpm
# Relative abundance (fraction of total)
humann_renorm_table \
-i sample_genefamilies.tsv \
-o sample_genefamilies-relab.tsv \
--units relabStratification Analysis
# Split stratified table
humann_split_stratified_table \
-i sample_genefamilies-cpm.tsv \
-o stratified/
# Outputs:
# stratified/sample_genefamilies-stratified.tsv
# stratified/sample_genefamilies-unstratified.tsvRegrouping to Other Ontologies
# To KEGG Orthology (KO)
humann_regroup_table \
-i sample_genefamilies.tsv \
-g uniref90_ko \
-o sample_ko.tsv
# To GO terms
humann_regroup_table \
-i sample_genefamilies.tsv \
-g uniref90_go \
-o sample_go.tsv
# To EC numbers
humann_regroup_table \
-i sample_genefamilies.tsv \
-g uniref90_ec \
-o sample_ec.tsvMulti-Sample Analysis
# Join multiple samples
humann_join_tables \
-i results/functional_annotation/ \
-o merged_genefamilies.tsv \
--file_name genefamilies
# Normalize merged table
humann_renorm_table \
-i merged_genefamilies.tsv \
-o merged_genefamilies-cpm.tsv \
--units cpmAssociation Testing
# Test association with metadata
humann_associate \
-i merged_genefamilies-cpm.tsv \
-m metadata.tsv \
-f group \
-o association_results.tsvDifferential Abundance
# Using Python/Pandas
import pandas as pd
from scipy import stats
# Load normalized data
df = pd.read_csv('merged_genefamilies-cpm.tsv',
sep='\t', index_col=0)
# Load metadata
meta = pd.read_csv('metadata.tsv', sep='\t', index_col=0)
# Perform t-test for each gene family
results = []
for gene in df.index:
group_a = df.loc[gene, meta['group'] == 'Control']
group_b = df.loc[gene, meta['group'] == 'Treatment']
t_stat, p_val = stats.ttest_ind(group_a, group_b)
results.append({'gene': gene, 'p_value': p_val})
results_df = pd.DataFrame(results)Advanced Topics
Custom Gene Families
# Create custom database
humann_databases --download chocophlan full /custom/path
# Add custom sequences
cat custom_genes.fa >> /custom/path/chocophlan/*.ffnFocused Pathway Analysis
# Analyze specific pathways only
humann --input sample.fastq \
--output output/ \
--pathways metacyc \
--gap-fill on \
--pathway minMetatranscriptome Integration
# Use HUMAnN with transcriptomic data
humann --input sample_rna.fastq \
--output rnaseq_output/ \
--bypass-prescreen # Don't skip protein searchTroubleshooting
Issue: Running Too Slow
Solutions:
# Use faster settings
functional_annotation:
humann:
diamond_options: "--fast"
threads: 32
# Or use UniRef50 instead of UniRef90
protein_database: "/db/uniref50"Issue: Memory Errors
Solutions:
# Reduce threads
humann --threads 8 ...
# Use memory-efficient mode
humann --memory-use minimum ...
# Process samples sequentially instead of parallelIssue: Most Reads Unmapped
Diagnostic steps:
# 1. Check input quality
fastqc sample.fastq
# 2. Verify database installation
ls -lh /db/humann/chocophlan/
ls -lh /db/humann/uniref90/
# 3. Check species coverage in MetaPhlAn output
# If unknown species dominate, need broader databaseIssue: Empty Pathway Results
Solutions:
functional_annotation:
humann:
gap_fill: true # Enable gap filling
minpath: false # Disable strict MinPathSee Also
- Taxonomic Profiling - Species classification
- Diversity Analysis - Community structure
- Configuration Guide - Parameter details
- HUMAnN Documentation - Official HUMAnN docs