WMS_STRAIN#
This module is a part of metaFun pipeline, providing strain-level analysis of whole metagenome sequencing data using InStrain.
Overview#
The WMS_STRAIN module performs strain-level analysis of metagenomic samples to characterize microdiversity within microbial populations. It utilizes InStrain to profile single nucleotide variants (SNVs), calculate nucleotide diversity metrics, and compare strain populations across samples. The module generates comprehensive metrics including pN/pS ratios for selection pressure analysis, nucleotide diversity (π) values, and strain sharing patterns between samples.
Module Execution#
# Basic usage with phyloseq object from WMS_TAXONOMY
(metafun) metafun -module WMS_STRAIN -i results/metagenome/RAWREAD_QC/read_filtered \
--phyloseq_object results/metagenome/WMS_TAXONOMY/phyloseq/phyloseq_object_sylph.RDS
# With custom prevalence filtering
(metafun) metafun -module WMS_STRAIN -i results/metagenome/RAWREAD_QC/read_filtered \
--phyloseq_object results/metagenome/WMS_TAXONOMY/phyloseq/phyloseq_object_sylph.RDS \
--prevalence_threshold 10 --min_abundance 0.001
# With external metadata file
(metafun) metafun -module WMS_STRAIN -i results/metagenome/RAWREAD_QC/read_filtered \
--phyloseq_object results/metagenome/WMS_TAXONOMY/phyloseq/phyloseq_object_sylph.RDS \
-m metadata.csv -s 1
# Allocate specific CPU resources
(metafun) metafun -module WMS_STRAIN -i results/metagenome/RAWREAD_QC/read_filtered \
--phyloseq_object results/metagenome/WMS_TAXONOMY/phyloseq/phyloseq_object_sylph.RDS \
-p 24
Module Operation Sequence#
This module performs the following steps:
Prevalence filtering from phyloseq object:
Filters taxa based on prevalence threshold (default: 5% of samples)
Applies minimum abundance filter (default: 0.1%)
Matches prevalent taxa to GTDB genomes
Extracts sample metadata from phyloseq or external file
Genome preparation:
Fetches reference genomes from GTDB database
Concatenates genomes into combined reference FASTA
Generates scaffold-to-bin (STB) mapping file
Builds Bowtie2 index for read mapping
Gene annotation:
Runs Prodigal on individual genomes (parallelized)
Concatenates gene predictions (proteins, genes, GFF)
Runs eggNOG-mapper for functional annotation (COG categories)
Read mapping to reference genomes:
Maps quality-filtered reads to combined reference using Bowtie2
Generates sorted BAM files for each sample
Creates index files for efficient access
InStrain profiling on individual samples:
Profiles each sample to detect SNVs at strain-level resolution
Calculates nucleotide diversity (π) using Nei and Li (1979) method
Computes pN/pS ratios to assess selection pressure
Generates gene-level and genome-level metrics
InStrain comparison across samples:
Compares strain populations across all samples
Calculates popANI and conANI between sample pairs
Identifies shared strains (default: 99.999% popANI threshold)
Generates comparison matrices for downstream analysis
Data aggregation for visualization:
Combines profile results from all samples
Integrates with GTDB taxonomy and sample metadata
Prepares RDS files for INTERACTIVE_STRAIN module
Parameters#
${launchDir} is the directory where you execute metaFun, and utilized as output base directory.
Parameter |
Description |
Default Value |
Note |
|---|---|---|---|
|
Input directory containing filtered reads |
Required |
Output from RAWREAD_QC workflow |
|
Phyloseq RDS file from WMS_TAXONOMY |
Required |
Path to phyloseq object for prevalence filtering |
|
Path to metadata file |
Optional |
CSV file; if not provided, extracts from phyloseq |
|
Column number for sample IDs in metadata |
|
Matches sample IDs in read filenames |
|
Minimum % of samples for prevalence |
|
Taxa must be present in this % of samples |
|
Minimum relative abundance threshold |
|
Filter taxa below 0.1% abundance |
|
Minimum coverage for InStrain |
|
Higher values = more confident calls |
|
Minimum SNP frequency threshold |
|
Filter low-frequency variants |
|
ANI threshold for read assignment |
|
0.92=strain, 0.95=species, 0.99=clonal |
|
Number of CPUs to use |
Auto-detected |
Adjust based on your system capabilities |
|
Output directory |
|
Where results will be saved |
Inputs and Outputs#
Inputs#
Quality-controlled paired-end metagenomic reads (output from RAWREAD_QC workflow)
Phyloseq object from WMS_TAXONOMY (contains taxonomy and sample metadata)
Optional: External metadata file (CSV format) to replace phyloseq metadata
Outputs#
Prevalent taxa list with GTDB genome matches
InStrain profile results for each sample
Nucleotide diversity metrics per genome
pN/pS ratio tables (gene-level and genome-wide)
Strain comparison matrices (popANI, conANI)
Preprocessed RDS files for INTERACTIVE_STRAIN visualization
Output directory structure#
The output is organized in the following directory structure:
${launchDir}/results/metagenome/WMS_STRAIN/
├── 01_prevalent_taxa/ # Prevalence filtering results
│ ├── prevalent_taxa_taxonomy_ids.txt # Filtered taxa IDs
│ ├── prevalent_taxa_metadata.tsv # GTDB metadata for prevalent taxa
│ ├── prevalent_taxa_genome_paths.txt # Paths to GTDB genomes
│ ├── prevalence_summary.tsv # Prevalence statistics
│ └── sample_metadata.csv # Extracted/provided sample metadata
├── 02_genome_prep/ # Genome preparation
│ ├── all_genomes_combined.fa # Concatenated reference genomes
│ ├── prevalent_taxa.stb # Scaffold-to-bin mapping
│ └── bowtie2_index/ # Bowtie2 index files
├── 03_gene_annotation/ # Gene predictions and annotations
│ ├── genes.faa # Protein sequences
│ ├── genes.fna # Gene sequences
│ ├── genes.gff # Gene annotations (GFF format)
│ └── eggnog_results.emapper.annotations # eggNOG functional annotations
├── 04_bam_files/ # Read mapping results
│ ├── ${sample_id}.sorted.bam # Sorted BAM files
│ └── ${sample_id}.sorted.bam.bai # BAM index files
├── 05_instrain_profiles/ # InStrain profile results
│ ├── ${sample_id}_instrain_profile/ # Sample-specific profile directory
│ │ ├── output/ # Main output files
│ │ │ ├── ${sample_id}_genome_info.tsv # Genome-level metrics
│ │ │ ├── ${sample_id}_gene_info.tsv # Gene-level metrics
│ │ │ └── ${sample_id}_scaffold_info.tsv # Scaffold metrics
│ │ └── raw_data/
│ │ └── genes_SNP_count.csv.gz # SNP counts for pN/pS calculation
│ └── validation_summary.txt # Profile validation results
├── 06_instrain_compare/ # InStrain comparison results
│ └── instrainComparer_output/
│ └── output/
│ ├── comparisonsTable.tsv # Pairwise sample comparisons
│ └── genomeWide_compare.tsv # popANI/conANI metrics
└── 07_shiny_data/ # Preprocessed data for visualization
├── integrated_microbiome_data.rds # Combined R data object
├── pN_pS_gene_level.rds # Gene-level pN/pS data
├── pN_pS_genome_wide.rds # Genome-wide pN/pS data
└── eggnog_annotations_subset.rds # Filtered eggNOG annotations
Key Metrics Explained#
Nucleotide Diversity (π)#
Nucleotide diversity measures within-population genetic variation using the Nei and Li (1979) method:
Formula: π = 1 - Σ(frequency of each base)²
InStrain calculates π at every genomic position with sufficient coverage (≥5x by default) and averages across genes/genomes. This metric is robust to coverage differences between samples.
High π: Indicates diverse strain population with many SNVs
Low π: Indicates homogeneous population or recent selective sweep
pN/pS Ratio#
The ratio of nonsynonymous to synonymous substitutions, indicating selection pressure:
pN/pS < 1: Purifying (negative) selection - deleterious mutations removed
pN/pS ≈ 1: Neutral evolution - no selective pressure
pN/pS > 1: Positive selection - beneficial mutations favored
Population ANI (popANI) and Consensus ANI (conANI)#
InStrain calculates two ANI metrics for strain comparison:
popANI (population-level ANI): Considers both major and minor alleles, accounting for within-sample microdiversity. A popANI substitution is called only when no alleles are shared between samples.
conANI (consensus ANI): Compares only consensus (major) alleles between samples. A conANI substitution is called when samples have different major alleles.
The default threshold of 99.999% popANI is recommended for identifying shared strains. For more details, see the InStrain documentation.
Nextflow Processes in WMS_STRAIN Module#
Process |
InputDir |
OutputDir |
Note |
|---|---|---|---|
prevalence_filter_phyloseq |
Phyloseq RDS |
|
Filters taxa by prevalence |
concat_genomes |
GTDB genome paths |
|
Concatenates reference genomes |
generate_stb |
GTDB genome paths |
|
Creates scaffold-to-bin mapping |
bowtie2_build |
Combined FASTA |
|
Builds Bowtie2 index |
prodigal_per_genome |
Individual genomes |
Work directory |
Predicts genes (parallelized) |
concat_gene_predictions |
Prodigal outputs |
|
Concatenates gene predictions |
eggnog_mapper |
Protein sequences |
|
Functional annotation |
bowtie2_mapping |
Filtered reads |
|
Maps reads to reference |
instrain_profile |
BAM files |
|
Profiles strain-level variation |
validate_instrain_profiles |
Profile directories |
|
Validates profiles |
instrain_compare |
Valid profiles |
|
Compares strains across samples |
aggregate_for_shiny |
All outputs |
|
Prepares data for visualization |
Tools Used in WMS_STRAIN#
Tool |
Purpose |
Version |
Default parameters |
Parameters that can be selected |
|---|---|---|---|---|
Bowtie2 |
Read mapping |
2.5+ |
|
|
InStrain |
Strain profiling |
1.8+ |
|
|
Prodigal |
Gene prediction |
2.6.3 |
|
N/A |
eggNOG-mapper |
Functional annotation |
2.1+ |
|
|
samtools |
BAM processing |
1.17+ |
|
N/A |
Usage Notes#
The WMS_STRAIN module requires a phyloseq object from WMS_TAXONOMY for prevalence filtering.
Prevalent taxa are automatically matched to GTDB genomes for reference-based analysis.
Metadata can be extracted from the phyloseq object or provided separately via
-mparameter.InStrain requires significant computational resources. CPU allocation is auto-detected but can be adjusted with
-p.Higher coverage samples provide more reliable strain-level metrics. Samples with average coverage < 5x may have limited strain resolution.
The module automatically validates InStrain profiles and skips comparison if fewer than 2 valid profiles exist.
For optimal results, use appropriate
--prevalence_thresholdto focus on abundant, prevalent taxa.
Next Steps#
After running WMS_STRAIN, you can:
Explore the results interactively using the INTERACTIVE_STRAIN module:
(metafun) metafun -module INTERACTIVE_STRAIN -i results/metagenome/WMS_STRAIN/07_shiny_data
Perform deeper analysis by:
Examining nucleotide diversity patterns across conditions
Analyzing pN/pS ratios to identify genes under selection (COG functional categories)
Comparing strain populations between sample groups using popANI thresholds
Correlating strain-level metrics with metadata variables
The WMS_STRAIN module provides comprehensive insights into strain-level diversity within microbial communities, enabling researchers to understand not just which species are present, but the genetic variation within those species.