WMS_STRAIN#

This module is a part of metaFun pipeline, providing strain-level analysis of whole metagenome sequencing data using InStrain.

Overview#

The WMS_STRAIN module performs strain-level analysis of metagenomic samples to characterize microdiversity within microbial populations. It utilizes InStrain to profile single nucleotide variants (SNVs), calculate nucleotide diversity metrics, and compare strain populations across samples. The module generates comprehensive metrics including pN/pS ratios for selection pressure analysis, nucleotide diversity (π) values, and strain sharing patterns between samples.

Module Execution#

# Basic usage with phyloseq object from WMS_TAXONOMY
(metafun) metafun -module WMS_STRAIN -i results/metagenome/RAWREAD_QC/read_filtered \
    --phyloseq_object results/metagenome/WMS_TAXONOMY/phyloseq/phyloseq_object_sylph.RDS

# With custom prevalence filtering
(metafun) metafun -module WMS_STRAIN -i results/metagenome/RAWREAD_QC/read_filtered \
    --phyloseq_object results/metagenome/WMS_TAXONOMY/phyloseq/phyloseq_object_sylph.RDS \
    --prevalence_threshold 10 --min_abundance 0.001

# With external metadata file
(metafun) metafun -module WMS_STRAIN -i results/metagenome/RAWREAD_QC/read_filtered \
    --phyloseq_object results/metagenome/WMS_TAXONOMY/phyloseq/phyloseq_object_sylph.RDS \
    -m metadata.csv -s 1

# Allocate specific CPU resources
(metafun) metafun -module WMS_STRAIN -i results/metagenome/RAWREAD_QC/read_filtered \
    --phyloseq_object results/metagenome/WMS_TAXONOMY/phyloseq/phyloseq_object_sylph.RDS \
    -p 24

Module Operation Sequence#

This module performs the following steps:

  1. Prevalence filtering from phyloseq object:

    • Filters taxa based on prevalence threshold (default: 5% of samples)

    • Applies minimum abundance filter (default: 0.1%)

    • Matches prevalent taxa to GTDB genomes

    • Extracts sample metadata from phyloseq or external file

  2. Genome preparation:

    • Fetches reference genomes from GTDB database

    • Concatenates genomes into combined reference FASTA

    • Generates scaffold-to-bin (STB) mapping file

    • Builds Bowtie2 index for read mapping

  3. Gene annotation:

    • Runs Prodigal on individual genomes (parallelized)

    • Concatenates gene predictions (proteins, genes, GFF)

    • Runs eggNOG-mapper for functional annotation (COG categories)

  4. Read mapping to reference genomes:

    • Maps quality-filtered reads to combined reference using Bowtie2

    • Generates sorted BAM files for each sample

    • Creates index files for efficient access

  5. InStrain profiling on individual samples:

    • Profiles each sample to detect SNVs at strain-level resolution

    • Calculates nucleotide diversity (π) using Nei and Li (1979) method

    • Computes pN/pS ratios to assess selection pressure

    • Generates gene-level and genome-level metrics

  6. InStrain comparison across samples:

    • Compares strain populations across all samples

    • Calculates popANI and conANI between sample pairs

    • Identifies shared strains (default: 99.999% popANI threshold)

    • Generates comparison matrices for downstream analysis

  7. Data aggregation for visualization:

    • Combines profile results from all samples

    • Integrates with GTDB taxonomy and sample metadata

    • Prepares RDS files for INTERACTIVE_STRAIN module

Parameters#

${launchDir} is the directory where you execute metaFun, and utilized as output base directory.

Parameter

Description

Default Value

Note

-i, --input_dir

Input directory containing filtered reads

Required

Output from RAWREAD_QC workflow

--phyloseq_object

Phyloseq RDS file from WMS_TAXONOMY

Required

Path to phyloseq object for prevalence filtering

-m, --metadata

Path to metadata file

Optional

CSV file; if not provided, extracts from phyloseq

-s, --sampleIDcolumn

Column number for sample IDs in metadata

1

Matches sample IDs in read filenames

--prevalence_threshold

Minimum % of samples for prevalence

5

Taxa must be present in this % of samples

--min_abundance

Minimum relative abundance threshold

0.001

Filter taxa below 0.1% abundance

--min_coverage

Minimum coverage for InStrain

5

Higher values = more confident calls

--min_freq

Minimum SNP frequency threshold

0.05

Filter low-frequency variants

--min_read_ani

ANI threshold for read assignment

0.92

0.92=strain, 0.95=species, 0.99=clonal

-p, --cpus

Number of CPUs to use

Auto-detected

Adjust based on your system capabilities

-o, --outdir

Output directory

"${launchDir}/results/metagenome/WMS_STRAIN"

Where results will be saved

Inputs and Outputs#

Inputs#

  • Quality-controlled paired-end metagenomic reads (output from RAWREAD_QC workflow)

  • Phyloseq object from WMS_TAXONOMY (contains taxonomy and sample metadata)

  • Optional: External metadata file (CSV format) to replace phyloseq metadata

Outputs#

  • Prevalent taxa list with GTDB genome matches

  • InStrain profile results for each sample

  • Nucleotide diversity metrics per genome

  • pN/pS ratio tables (gene-level and genome-wide)

  • Strain comparison matrices (popANI, conANI)

  • Preprocessed RDS files for INTERACTIVE_STRAIN visualization

Output directory structure#

The output is organized in the following directory structure:

Output directory structure#
${launchDir}/results/metagenome/WMS_STRAIN/
├── 01_prevalent_taxa/                    # Prevalence filtering results   ├── prevalent_taxa_taxonomy_ids.txt   # Filtered taxa IDs   ├── prevalent_taxa_metadata.tsv       # GTDB metadata for prevalent taxa   ├── prevalent_taxa_genome_paths.txt   # Paths to GTDB genomes   ├── prevalence_summary.tsv            # Prevalence statistics   └── sample_metadata.csv               # Extracted/provided sample metadata
├── 02_genome_prep/                       # Genome preparation   ├── all_genomes_combined.fa           # Concatenated reference genomes   ├── prevalent_taxa.stb               # Scaffold-to-bin mapping   └── bowtie2_index/                    # Bowtie2 index files
├── 03_gene_annotation/                   # Gene predictions and annotations   ├── genes.faa                         # Protein sequences   ├── genes.fna                         # Gene sequences   ├── genes.gff                         # Gene annotations (GFF format)   └── eggnog_results.emapper.annotations # eggNOG functional annotations
├── 04_bam_files/                         # Read mapping results   ├── ${sample_id}.sorted.bam           # Sorted BAM files   └── ${sample_id}.sorted.bam.bai       # BAM index files
├── 05_instrain_profiles/                 # InStrain profile results   ├── ${sample_id}_instrain_profile/    # Sample-specific profile directory      ├── output/                       # Main output files         ├── ${sample_id}_genome_info.tsv    # Genome-level metrics         ├── ${sample_id}_gene_info.tsv      # Gene-level metrics         └── ${sample_id}_scaffold_info.tsv  # Scaffold metrics      └── raw_data/
│          └── genes_SNP_count.csv.gz    # SNP counts for pN/pS calculation   └── validation_summary.txt            # Profile validation results
├── 06_instrain_compare/                  # InStrain comparison results   └── instrainComparer_output/
│       └── output/
│           ├── comparisonsTable.tsv      # Pairwise sample comparisons           └── genomeWide_compare.tsv    # popANI/conANI metrics
└── 07_shiny_data/                        # Preprocessed data for visualization
    ├── integrated_microbiome_data.rds    # Combined R data object
    ├── pN_pS_gene_level.rds              # Gene-level pN/pS data
    ├── pN_pS_genome_wide.rds             # Genome-wide pN/pS data
    └── eggnog_annotations_subset.rds     # Filtered eggNOG annotations

Key Metrics Explained#

Nucleotide Diversity (π)#

Nucleotide diversity measures within-population genetic variation using the Nei and Li (1979) method:

Formula: π = 1 - Σ(frequency of each base)²

InStrain calculates π at every genomic position with sufficient coverage (≥5x by default) and averages across genes/genomes. This metric is robust to coverage differences between samples.

  • High π: Indicates diverse strain population with many SNVs

  • Low π: Indicates homogeneous population or recent selective sweep

pN/pS Ratio#

The ratio of nonsynonymous to synonymous substitutions, indicating selection pressure:

  • pN/pS < 1: Purifying (negative) selection - deleterious mutations removed

  • pN/pS ≈ 1: Neutral evolution - no selective pressure

  • pN/pS > 1: Positive selection - beneficial mutations favored

Population ANI (popANI) and Consensus ANI (conANI)#

InStrain calculates two ANI metrics for strain comparison:

  • popANI (population-level ANI): Considers both major and minor alleles, accounting for within-sample microdiversity. A popANI substitution is called only when no alleles are shared between samples.

  • conANI (consensus ANI): Compares only consensus (major) alleles between samples. A conANI substitution is called when samples have different major alleles.

The default threshold of 99.999% popANI is recommended for identifying shared strains. For more details, see the InStrain documentation.

Nextflow Processes in WMS_STRAIN Module#

Process

InputDir

OutputDir

Note

prevalence_filter_phyloseq

Phyloseq RDS

01_prevalent_taxa

Filters taxa by prevalence

concat_genomes

GTDB genome paths

02_genome_prep

Concatenates reference genomes

generate_stb

GTDB genome paths

02_genome_prep

Creates scaffold-to-bin mapping

bowtie2_build

Combined FASTA

02_genome_prep/bowtie2_index

Builds Bowtie2 index

prodigal_per_genome

Individual genomes

Work directory

Predicts genes (parallelized)

concat_gene_predictions

Prodigal outputs

03_gene_annotation

Concatenates gene predictions

eggnog_mapper

Protein sequences

03_gene_annotation

Functional annotation

bowtie2_mapping

Filtered reads

04_bam_files

Maps reads to reference

instrain_profile

BAM files

05_instrain_profiles

Profiles strain-level variation

validate_instrain_profiles

Profile directories

05_instrain_profiles

Validates profiles

instrain_compare

Valid profiles

06_instrain_compare

Compares strains across samples

aggregate_for_shiny

All outputs

07_shiny_data

Prepares data for visualization

Tools Used in WMS_STRAIN#

Tool

Purpose

Version

Default parameters

Parameters that can be selected

Bowtie2

Read mapping

2.5+

--sensitive-local --no-unal

--threads ${task.cpus}

InStrain

Strain profiling

1.8+

--min_cov 5, --database_mode

-p ${task.cpus}, --min_read_ani

Prodigal

Gene prediction

2.6.3

-p single (per genome)

N/A

eggNOG-mapper

Functional annotation

2.1+

-m mmseqs

--cpu ${task.cpus}

samtools

BAM processing

1.17+

-@ ${task.cpus}

N/A

Usage Notes#

  • The WMS_STRAIN module requires a phyloseq object from WMS_TAXONOMY for prevalence filtering.

  • Prevalent taxa are automatically matched to GTDB genomes for reference-based analysis.

  • Metadata can be extracted from the phyloseq object or provided separately via -m parameter.

  • InStrain requires significant computational resources. CPU allocation is auto-detected but can be adjusted with -p.

  • Higher coverage samples provide more reliable strain-level metrics. Samples with average coverage < 5x may have limited strain resolution.

  • The module automatically validates InStrain profiles and skips comparison if fewer than 2 valid profiles exist.

  • For optimal results, use appropriate --prevalence_threshold to focus on abundant, prevalent taxa.

Next Steps#

After running WMS_STRAIN, you can:

  1. Explore the results interactively using the INTERACTIVE_STRAIN module:

    (metafun) metafun -module INTERACTIVE_STRAIN -i results/metagenome/WMS_STRAIN/07_shiny_data
    
  2. Perform deeper analysis by:

    • Examining nucleotide diversity patterns across conditions

    • Analyzing pN/pS ratios to identify genes under selection (COG functional categories)

    • Comparing strain populations between sample groups using popANI thresholds

    • Correlating strain-level metrics with metadata variables

The WMS_STRAIN module provides comprehensive insights into strain-level diversity within microbial communities, enabling researchers to understand not just which species are present, but the genetic variation within those species.