enConfiguration

Configuration Guide

Complete reference for configuring MICOS-2024 analysis parameters.


Configuration Overview

MICOS-2024 uses a multi-layer configuration system based on YAML format. The system supports:

  • Modular configuration: Each analysis module has independent settings
  • Variable substitution: Use ${variable} syntax for reusable values
  • Configuration inheritance: Default → Project → Command-line overrides
  • Automatic validation: Check configuration before analysis starts

Configuration Hierarchy

1. Default values (built into code)

2. Configuration files (config/analysis.yaml)

3. Environment variables (MICOS_* variables)

4. Command-line arguments (highest priority)

Configuration Files

File Structure

config/
├── analysis.yaml          # Main analysis parameters
├── databases.yaml         # Database paths
├── samples.tsv           # Sample metadata
└── config.conf           # Cromwell workflow settings

Quick Setup

# Copy template files
cp config/analysis.yaml.template config/analysis.yaml
cp config/databases.yaml.template config/databases.yaml
cp config/samples.tsv.template config/samples.tsv
 
# Edit configurations
nano config/analysis.yaml
nano config/databases.yaml
nano config/samples.tsv

Project Configuration

Basic Project Information

project:
  name: "Gut_Microbiome_Study_2024"
  description: "Analysis of human gut microbiome in treatment vs control"
  version: "1.0.0"
  author: "Research Team"
  email: "team@example.com"

Path Configuration

paths:
  input_dir: "data/raw_input"           # Raw data directory
  output_dir: "results"                  # Results directory
  temp_dir: "/tmp/micos"                 # Temporary files (use fast storage)
  log_dir: "logs"                        # Log files directory
 
  # Database paths (can also use databases.yaml)
  databases:
    kraken2: "/data/databases/kraken2/standard"
    kneaddata: "/data/databases/kneaddata/human_genome"
    humann: "/data/databases/humann"

Path Variables

Use variable substitution for cleaner configuration:

paths:
  base_dir: "/project/micos_analysis"
  input_dir: "${paths.base_dir}/data"
  output_dir: "${paths.base_dir}/results"
  temp_dir: "${paths.base_dir}/tmp"

Resource Configuration

Compute Resources

resources:
  max_threads: 16              # Maximum parallel threads
  max_memory: "32GB"           # Maximum memory allocation
  max_time: "24h"              # Maximum runtime per task
 
  # Per-module thread allocation
  thread_allocation:
    quality_control: 8
    taxonomic_profiling: 16
    functional_annotation: 8
    diversity_analysis: 4

Resource Guidelines

Dataset SizeThreadsMemoryTemp Storage
Small (< 10 samples)816 GB50 GB
Medium (10-50 samples)1632 GB200 GB
Large (50-200 samples)3264 GB1 TB
Very Large (> 200 samples)64+128 GB2 TB+

Storage Optimization

resources:
  # Use SSD for temporary files
  temp_dir: "/ssd/tmp"
 
  # Enable compression
  compression:
    enabled: true
    level: 6              # gzip compression level (1-9)
 
  # Cleanup settings
  cleanup:
    remove_intermediate: true
    keep_logs: true

Module-Specific Parameters

Quality Control Module

quality_control:
  enabled: true
 
  fastqc:
    enabled: true
    threads: 4
    memory: "2GB"
 
  kneaddata:
    enabled: true
    threads: 8
 
    # Quality filtering
    min_quality: 20           # Minimum base quality score
    min_length: 50            # Minimum read length after trimming
 
    # Trimmomatic options
    trimmomatic_options: "SLIDINGWINDOW:4:20 MINLEN:50"
 
    # Host removal
    reference_db: "${paths.databases.kneaddata}"
 
    # Additional options
    remove_intermediate: true
    bypass_trf: false         # Skip tandem repeat filtering
    threads: 8

Taxonomic Profiling Module

taxonomic_profiling:
  enabled: true
 
  kraken2:
    enabled: true
    threads: 16
 
    # Classification parameters
    confidence: 0.1           # Confidence threshold (0-1)
                            # Lower = more classified reads, potentially more false positives
                            # Higher = fewer classified reads, higher precision
    min_base_quality: 20      # Minimum base quality for classification
    min_hit_groups: 2         # Minimum number of hit groups
 
    # Memory options
    memory_mapping: true      # Use memory-mapped I/O (faster, more memory)
 
    # Output options
    use_names: true           # Include taxonomic names in output
    report_zeros: false       # Include taxa with zero counts
 
  kraken_biom:
    enabled: true
    format: "hdf5"            # hdf5 or json
 
  krona:
    enabled: true
    max_depth: 7              # Maximum taxonomic depth for visualization

Kraken2 Confidence Parameter Guide

ConfidenceUse CaseExpected Classification Rate
0.0Maximum sensitivity80-95%
0.1Balanced (default)60-80%
0.3Higher precision40-60%
0.5Conservative20-40%

Diversity Analysis Module

diversity_analysis:
  enabled: true
 
  qiime2:
    enabled: true
 
    # Feature table filtering
    feature_filtering:
      min_frequency: 10       # Minimum count per feature
      min_samples: 3          # Minimum samples feature must appear in
 
    # Rarefaction depth (auto-detect if not specified)
    # sampling_depth: 10000
 
    # Alpha diversity metrics
    alpha_metrics:
      - "shannon"            # Shannon diversity index
      - "chao1"              # Chao1 richness estimator
      - "simpson"            # Simpson index
      - "observed_features"   # Number of observed ASVs/OTUs
      - "pielou_e"           # Pielou's evenness
 
    # Beta diversity metrics
    beta_metrics:
      - "braycurtis"         # Bray-Curtis dissimilarity
      - "jaccard"            # Jaccard distance
      - "unweighted_unifrac" # Unweighted UniFrac
      - "weighted_unifrac"   # Weighted UniFrac

Functional Annotation Module

functional_annotation:
  enabled: true
 
  humann:
    enabled: true
    threads: 8
 
    # Database paths
    nucleotide_database: "${paths.databases.humann}/chocophlan"
    protein_database: "${paths.databases.humann}/uniref90"
 
    # Search options
    search_mode: "diamond"    # diamond or bowtie2
 
    # Annotation databases
    pathway_database: "metacyc"   # metacyc or reactome
 
    # Output normalization
    normalization: "cpm"      # cpm (copies per million) or relab

Configuration Validation

Automatic Validation

MICOS-2024 validates configuration before running:

# Validate configuration
python -m micos.cli validate-config --config config/analysis.yaml
 
# Dry run to test configuration
python -m micos.cli full-run \
  --config config/analysis.yaml \
  --dry-run

Validation Checks

CheckDescriptionFailure Action
YAML syntaxValid YAML formatError + exit
Required fieldsAll mandatory fields presentError + exit
Path existenceInput/output directories existWarning/Error
Parameter rangesValues within valid rangesWarning
Database integrityDatabase files are validError + exit
Resource limitsMemory/threads within system limitsWarning

Best Practices

1. Start with Templates

# Always start from templates
cp config/*.template config/
# Then customize

2. Use Relative Paths

# Good - portable across systems
paths:
  input_dir: "data/raw"
  output_dir: "results"
 
# Less portable
paths:
  input_dir: "/home/user/specific/path/data"

3. Version Control Configuration

# Track configuration templates
git add config/*.template
 
# Don't track actual configs (may contain paths specific to your system)
echo "config/*.yaml" >> .gitignore

4. Document Customizations

project:
  name: "Study_2024"
 
analysis:
  # Increased confidence due to high-quality data
  kraken2:
    confidence: 0.15

5. Test with Subset

# test_config.yaml - analyze only first 3 samples
samples:
  subset: 3

Examples

Example 1: Quick Test Configuration

project:
  name: "Quick_Test"
 
paths:
  input_dir: "test_data"
  output_dir: "test_results"
 
resources:
  max_threads: 4
  max_memory: "8GB"
 
taxonomic_profiling:
  kraken2:
    confidence: 0.1
 
# Use MiniKraken for speed
databases:
  kraken2: "/db/minikraken2_v2_8GB"

Example 2: Production Pipeline

project:
  name: "Clinical_Study_Phase2"
  version: "2.1.0"
  description: "Large-scale clinical metagenomics analysis"
 
paths:
  base_dir: "/projects/clinical_study"
  input_dir: "${paths.base_dir}/raw_data"
  output_dir: "${paths.base_dir}/results"
  temp_dir: "/ssd/tmp"
 
resources:
  max_threads: 32
  max_memory: "128GB"
 
quality_control:
  kneaddata:
    min_quality: 25
    min_length: 75
    threads: 16
 
taxonomic_profiling:
  kraken2:
    confidence: 0.1
    threads: 32
    memory_mapping: true
 
diversity_analysis:
  qiime2:
    sampling_depth: 50000
    alpha_metrics: ["shannon", "chao1", "observed_features"]
    beta_metrics: ["braycurtis", "weighted_unifrac"]
 
differential_abundance:
  methods:
    deseq2:
      enabled: true
      alpha: 0.01
      fold_change_threshold: 4.0

See Also