WMS_STRAIN

WMS_STRAIN#

This module is a part of metaFun pipeline, providing strain-level analysis of whole metagenome sequencing data using InStrain.

Overview#

The WMS_STRAIN module performs strain-level analysis of metagenomic samples to characterize microdiversity within microbial populations. It utilizes InStrain to profile single nucleotide variants (SNVs), calculate nucleotide diversity metrics, and compare strain populations across samples. The module generates comprehensive metrics including pN/pS ratios for selection pressure analysis, nucleotide diversity (π) values, and strain sharing patterns between samples.

Module Execution#

# Basic usage with phyloseq object from WMS_TAXONOMY
(metafun) metafun -module WMS_STRAIN -i results/metagenome/RAWREAD_QC/read_filtered \
    --phyloseq_object results/metagenome/WMS_TAXONOMY/phyloseq/phyloseq_object_sylph.RDS

# With custom prevalence filtering
(metafun) metafun -module WMS_STRAIN -i results/metagenome/RAWREAD_QC/read_filtered \
    --phyloseq_object results/metagenome/WMS_TAXONOMY/phyloseq/phyloseq_object_sylph.RDS \
    --prevalence_threshold 10 --min_abundance 0.001

# With external metadata file
(metafun) metafun -module WMS_STRAIN -i results/metagenome/RAWREAD_QC/read_filtered \
    --phyloseq_object results/metagenome/WMS_TAXONOMY/phyloseq/phyloseq_object_sylph.RDS \
    -m metadata.csv -s 1

# Allocate specific CPU resources
(metafun) metafun -module WMS_STRAIN -i results/metagenome/RAWREAD_QC/read_filtered \
    --phyloseq_object results/metagenome/WMS_TAXONOMY/phyloseq/phyloseq_object_sylph.RDS \
    -p 24

Module Operation Sequence#

This module performs the following steps:

Prevalence filtering from phyloseq object:
- Filters taxa based on prevalence threshold (default: 5% of samples)
- Applies minimum abundance filter (default: 0.1%)
- Matches prevalent taxa to GTDB genomes
- Extracts sample metadata from phyloseq or external file
Genome preparation:
- Fetches reference genomes from GTDB database
- Concatenates genomes into combined reference FASTA
- Generates scaffold-to-bin (STB) mapping file
- Builds Bowtie2 index for read mapping
Gene annotation:
- Runs Prodigal on individual genomes (parallelized)
- Concatenates gene predictions (proteins, genes, GFF)
- Runs eggNOG-mapper for functional annotation (COG categories)
Read mapping to reference genomes:
- Maps quality-filtered reads to combined reference using Bowtie2
- Generates sorted BAM files for each sample
- Creates index files for efficient access
InStrain profiling on individual samples:
- Profiles each sample to detect SNVs at strain-level resolution
- Calculates nucleotide diversity (π) using Nei and Li (1979) method
- Computes pN/pS ratios to assess selection pressure
- Generates gene-level and genome-level metrics
InStrain comparison across samples:
- Compares strain populations across all samples
- Calculates popANI and conANI between sample pairs
- Identifies shared strains (default: 99.999% popANI threshold)
- Generates comparison matrices for downstream analysis
Data aggregation for visualization:
- Combines profile results from all samples
- Integrates with GTDB taxonomy and sample metadata
- Prepares RDS files for INTERACTIVE_STRAIN module

Parameters#

${launchDir} is the directory where you execute metaFun, and utilized as output base directory.

Parameter	Description	Default Value	Note
`-i, --input_dir`	Input directory containing filtered reads	Required	Output from RAWREAD_QC workflow
`--phyloseq_object`	Phyloseq RDS file from WMS_TAXONOMY	Required	Path to phyloseq object for prevalence filtering
`-m, --metadata`	Path to metadata file	Optional	CSV file; if not provided, extracts from phyloseq
`-s, --sampleIDcolumn`	Column number for sample IDs in metadata	`1`	Matches sample IDs in read filenames
`--prevalence_threshold`	Minimum % of samples for prevalence	`5`	Taxa must be present in this % of samples
`--min_abundance`	Minimum relative abundance threshold	`0.001`	Filter taxa below 0.1% abundance
`--min_coverage`	Minimum coverage for InStrain	`5`	Higher values = more confident calls
`--min_freq`	Minimum SNP frequency threshold	`0.05`	Filter low-frequency variants
`--min_read_ani`	ANI threshold for read assignment	`0.92`	0.92=strain, 0.95=species, 0.99=clonal
`-p, --cpus`	Number of CPUs to use	Auto-detected	Adjust based on your system capabilities
`-o, --outdir`	Output directory	`"${launchDir}/results/metagenome/WMS_STRAIN"`	Where results will be saved

Inputs and Outputs#

Inputs#

Quality-controlled paired-end metagenomic reads (output from RAWREAD_QC workflow)
Phyloseq object from WMS_TAXONOMY (contains taxonomy and sample metadata)
Optional: External metadata file (CSV format) to replace phyloseq metadata

Outputs#

Prevalent taxa list with GTDB genome matches
InStrain profile results for each sample
Nucleotide diversity metrics per genome
pN/pS ratio tables (gene-level and genome-wide)
Strain comparison matrices (popANI, conANI)
Preprocessed RDS files for INTERACTIVE_STRAIN visualization

Output directory structure#

The output is organized in the following directory structure:

Output directory structure#

${launchDir}/results/metagenome/WMS_STRAIN/
├── 01_prevalent_taxa/                    # Prevalence filtering results
│   ├── prevalent_taxa_taxonomy_ids.txt   # Filtered taxa IDs
│   ├── prevalent_taxa_metadata.tsv       # GTDB metadata for prevalent taxa
│   ├── prevalent_taxa_genome_paths.txt   # Paths to GTDB genomes
│   ├── prevalence_summary.tsv            # Prevalence statistics
│   └── sample_metadata.csv               # Extracted/provided sample metadata
├── 02_genome_prep/                       # Genome preparation
│   ├── all_genomes_combined.fa           # Concatenated reference genomes
│   ├── prevalent_taxa.stb               # Scaffold-to-bin mapping
│   └── bowtie2_index/                    # Bowtie2 index files
├── 03_gene_annotation/                   # Gene predictions and annotations
│   ├── genes.faa                         # Protein sequences
│   ├── genes.fna                         # Gene sequences
│   ├── genes.gff                         # Gene annotations (GFF format)
│   └── eggnog_results.emapper.annotations # eggNOG functional annotations
├── 04_bam_files/                         # Read mapping results
│   ├── ${sample_id}.sorted.bam           # Sorted BAM files
│   └── ${sample_id}.sorted.bam.bai       # BAM index files
├── 05_instrain_profiles/                 # InStrain profile results
│   ├── ${sample_id}_instrain_profile/    # Sample-specific profile directory
│   │   ├── output/                       # Main output files
│   │   │   ├── ${sample_id}_genome_info.tsv    # Genome-level metrics
│   │   │   ├── ${sample_id}_gene_info.tsv      # Gene-level metrics
│   │   │   └── ${sample_id}_scaffold_info.tsv  # Scaffold metrics
│   │   └── raw_data/
│   │       └── genes_SNP_count.csv.gz    # SNP counts for pN/pS calculation
│   └── validation_summary.txt            # Profile validation results
├── 06_instrain_compare/                  # InStrain comparison results
│   └── instrainComparer_output/
│       └── output/
│           ├── comparisonsTable.tsv      # Pairwise sample comparisons
│           └── genomeWide_compare.tsv    # popANI/conANI metrics
└── 07_shiny_data/                        # Preprocessed data for visualization
    ├── integrated_microbiome_data.rds    # Combined R data object
    ├── pN_pS_gene_level.rds              # Gene-level pN/pS data
    ├── pN_pS_genome_wide.rds             # Genome-wide pN/pS data
    └── eggnog_annotations_subset.rds     # Filtered eggNOG annotations

Key Metrics Explained#

Nucleotide Diversity (π)#

Nucleotide diversity measures within-population genetic variation using the Nei and Li (1979) method:

Formula: π = 1 - Σ(frequency of each base)²

InStrain calculates π at every genomic position with sufficient coverage (≥5x by default) and averages across genes/genomes. This metric is robust to coverage differences between samples.

High π: Indicates diverse strain population with many SNVs
Low π: Indicates homogeneous population or recent selective sweep

pN/pS Ratio#

The ratio of nonsynonymous to synonymous substitutions, indicating selection pressure:

pN/pS < 1: Purifying (negative) selection - deleterious mutations removed
pN/pS ≈ 1: Neutral evolution - no selective pressure
pN/pS > 1: Positive selection - beneficial mutations favored

Population ANI (popANI) and Consensus ANI (conANI)#

InStrain calculates two ANI metrics for strain comparison:

popANI (population-level ANI): Considers both major and minor alleles, accounting for within-sample microdiversity. A popANI substitution is called only when no alleles are shared between samples.
conANI (consensus ANI): Compares only consensus (major) alleles between samples. A conANI substitution is called when samples have different major alleles.

The default threshold of 99.999% popANI is recommended for identifying shared strains. For more details, see the InStrain documentation.

Nextflow Processes in WMS_STRAIN Module#

Process	InputDir	OutputDir	Note
prevalence_filter_phyloseq	Phyloseq RDS	`01_prevalent_taxa`	Filters taxa by prevalence
concat_genomes	GTDB genome paths	`02_genome_prep`	Concatenates reference genomes
generate_stb	GTDB genome paths	`02_genome_prep`	Creates scaffold-to-bin mapping
bowtie2_build	Combined FASTA	`02_genome_prep/bowtie2_index`	Builds Bowtie2 index
prodigal_per_genome	Individual genomes	Work directory	Predicts genes (parallelized)
concat_gene_predictions	Prodigal outputs	`03_gene_annotation`	Concatenates gene predictions
eggnog_mapper	Protein sequences	`03_gene_annotation`	Functional annotation
bowtie2_mapping	Filtered reads	`04_bam_files`	Maps reads to reference
instrain_profile	BAM files	`05_instrain_profiles`	Profiles strain-level variation
validate_instrain_profiles	Profile directories	`05_instrain_profiles`	Validates profiles
instrain_compare	Valid profiles	`06_instrain_compare`	Compares strains across samples
aggregate_for_shiny	All outputs	`07_shiny_data`	Prepares data for visualization

Tools Used in WMS_STRAIN#

Tool	Purpose	Version	Default parameters	Parameters that can be selected
Bowtie2	Read mapping	2.5+	`--sensitive-local --no-unal`	`--threads ${task.cpus}`
InStrain	Strain profiling	1.8+	`--min_cov 5`, `--database_mode`	`-p ${task.cpus}`, `--min_read_ani`
Prodigal	Gene prediction	2.6.3	`-p single` (per genome)	N/A
eggNOG-mapper	Functional annotation	2.1+	`-m mmseqs`	`--cpu ${task.cpus}`
samtools	BAM processing	1.17+	`-@ ${task.cpus}`	N/A

Usage Notes#

The WMS_STRAIN module requires a phyloseq object from WMS_TAXONOMY for prevalence filtering.
Prevalent taxa are automatically matched to GTDB genomes for reference-based analysis.
Metadata can be extracted from the phyloseq object or provided separately via -m parameter.
InStrain requires significant computational resources. CPU allocation is auto-detected but can be adjusted with -p.
Higher coverage samples provide more reliable strain-level metrics. Samples with average coverage < 5x may have limited strain resolution.
The module automatically validates InStrain profiles and skips comparison if fewer than 2 valid profiles exist.
For optimal results, use appropriate --prevalence_threshold to focus on abundant, prevalent taxa.

Next Steps#

After running WMS_STRAIN, you can:

Explore the results interactively using the INTERACTIVE_STRAIN module:

(metafun) metafun -module INTERACTIVE_STRAIN -i results/metagenome/WMS_STRAIN/07_shiny_data

Perform deeper analysis by:
- Examining nucleotide diversity patterns across conditions
- Analyzing pN/pS ratios to identify genes under selection (COG functional categories)
- Comparing strain populations between sample groups using popANI thresholds
- Correlating strain-level metrics with metadata variables

The WMS_STRAIN module provides comprehensive insights into strain-level diversity within microbial communities, enabling researchers to understand not just which species are present, but the genetic variation within those species.

WMS_STRAIN

Contents

WMS_STRAIN#

Overview#

Module Execution#

Module Operation Sequence#

Parameters#

Inputs and Outputs#

Inputs#

Outputs#

Output directory structure#

Key Metrics Explained#

Nucleotide Diversity (π)#

pN/pS Ratio#

Population ANI (popANI) and Consensus ANI (conANI)#

Nextflow Processes in WMS_STRAIN Module#

Tools Used in WMS_STRAIN#

Usage Notes#

Next Steps#