2. Beginners Guide to metaFun#
This guide provides detailed information about metaFun’s modules, their inputs, outputs, and how they interact with each other in the workflow.
2.1. Understanding metaFun Modules#
metaFun is organized into modules that can be run independently or as part of a workflow. Each module has specific inputs and outputs designed to work seamlessly together.
2.1.1. Module Relationships and Data Flow#
2.1.1.1. Genome-based analysis path:#
RAWREAD_QC → ASSEMBLY_BINNING → BIN_ASSESSMENT → GENOME_SELECTOR → COMPARATIVE_ANNOTATION → INTERACTIVE_COMPARATIVE
2.1.1.2. Read-based analysis path:#
RAWREAD_QC → WMS_TAXONOMY → INTERACTIVE_TAXONOMY
RAWREAD_QC → WMS_FUNCTION
RAWREAD_QC → WMS_TAXONOMY → WMS_STRAIN → INTERACTIVE_STRAIN
WMS_TAXONOMY → INTERACTIVE_NETWORK
2.2. Module Details#
2.2.1. RAWREAD_QC#
Purpose: Quality control of raw reads and host genome filtering
Inputs:
Raw paired-end reads (FASTQ format)
Optional: Custom host genome for filtering
Outputs:
Filtered reads in
read_filtered/directoryQuality reports in
qc_reports/directory
Key Parameters:
-i, --inputDir: Directory containing raw read files (required)-f, --filter: Host genome to filter (default: “human”)-o, --output: Output directory (default: “results/metagenome/RAWREAD_QC”)-p, --processors: Number of CPUs to use (default: 4)
Workflow Notes:
This is the starting point for both genome-based and read-based analyses
Output files are automatically used by subsequent modules
2.2.2. ASSEMBLY_BINNING#
Purpose: Metagenome assembly and binning
Inputs:
Filtered reads from RAWREAD_QC (or specified input directory)
Outputs:
Assembled contigs in
assembly/directoryMetagenomic bins in
final_bins/directoryAssembly statistics in
stats/directory
Key Parameters:
-i, --inputDir: Input directory (optional if following RAWREAD_QC)-o, --output: Output directory (default: “results/metagenome/ASSEMBLY_BINNING”)-p, --processors: Number of CPUs to use (default: 8)--megahit_presets: Preset for MEGAHIT assembler (default: “default”)--semibin2_mode: SemiBin2 binning mode (default: “self”)
Workflow Notes:
Takes filtered reads from RAWREAD_QC as input
Creates metagenomic bins for BIN_ASSESSMENT module
2.2.3. BIN_ASSESSMENT#
Purpose: Assess genome quality and taxonomy classification
Inputs:
Metagenomic bins from ASSEMBLY_BINNING
Metadata file with accession information
Outputs:
Quality assessment reports in
quality/directoryTaxonomy classification in
taxonomy/directoryCombined metadata CSV file with quality and taxonomy information
Key Parameters:
-m, --metadata: Metadata file with accession information (required)-c, --accession_column: Column in metadata containing accession information (required)-i, --inputDir: Input directory (optional if following ASSEMBLY_BINNING)-o, --output: Output directory (default: “results/metagenome/BIN_ASSESSMENT”)-p, --processors: Number of CPUs to use (default: 20)
Workflow Notes:
Critical for determining which genomes meet quality standards
Output is used by GENOME_SELECTOR module
2.2.4. GENOME_SELECTOR#
Purpose: Interactive genome selection interface
Inputs:
Combined metadata CSV file from BIN_ASSESSMENT
Outputs:
Selected genome list in
genome_selector_result.csv
Key Parameters:
-i, --input: Input metadata file (default: most recent BIN_ASSESSMENT output)--port: Port for web interface (default: 8050)
Workflow Notes:
Interactive web interface for selecting genomes
Output is used by COMPARATIVE_ANNOTATION module
2.2.5. COMPARATIVE_ANNOTATION#
Purpose: Comparative genomic analysis and annotation
Inputs:
Selected genomes from GENOME_SELECTOR
Metadata file with sample and analysis information
Outputs:
Pangenome analysis results in
pangenome/directoryFunctional annotations (KO, CAZy, VFDB, etc.) in respective directories
Visualization data for interactive analysis
Key Parameters:
-m, --metadata: Metadata file (required)--samplecol: Column in metadata containing sample IDs (required)-i, --inputDir: Input directory with genomes (optional)-o, --output: Output directory (default: based on run date/time)--metacol: Column for analysis grouping (optional for static plots)-p, --processors: Number of CPUs to use (default: 40)
Workflow Notes:
Comprehensive comparative genomics pipeline
Two main modes: annotation-only or annotation with static plots
Prepares data for INTERACTIVE_COMPARATIVE module
2.2.6. INTERACTIVE_COMPARATIVE#
Purpose: Interactive visualization of comparative genomics data
Inputs:
Annotation results from COMPARATIVE_ANNOTATION
Outputs:
Interactive visualization through Shiny web interface
Key Parameters:
-i, --inputDir: Input directory with COMPARATIVE_ANNOTATION results (required)-m, --metadata: Metadata file (required)-o, --output: Output directory for any saved results (default: “results/interactive_comparative”)
Workflow Notes:
Provides interactive dashboards for exploring comparative genomics data
No further modules depend on its output
2.2.7. WMS_TAXONOMY#
Purpose: Taxonomic profiling of metagenomic reads
Inputs:
Filtered reads from RAWREAD_QC (or specified input directory)
Metadata file with sample information
Outputs:
Taxonomic profiles in
profiles/directoryVisualization data in
visualization/directoryPhyloseq objects for statistical analysis in
phyloseq/directory
Key Parameters:
-m, --metadata: Metadata file (required)-s, --sampleIDcolumn: Column in metadata with sample IDs (required)-i, --inputDir: Input directory (optional if following RAWREAD_QC)--profiler: Taxonomic profiler to use, either “sylph” or “kraken2” (default: “sylph”)-a, --analysiscolumn: Column for analysis grouping (optional)-o, --output: Output directory (default: “results/metagenome/WMS_TAXONOMY”)-p, --processors: Number of CPUs to use (default: 4)
Workflow Notes:
Takes filtered reads from RAWREAD_QC as input
Output is used by INTERACTIVE_TAXONOMY module
2.2.8. INTERACTIVE_TAXONOMY#
Purpose: Interactive exploration of taxonomic profiles
Inputs:
Phyloseq objects from WMS_TAXONOMY
Outputs:
Interactive visualization through Shiny web interface
Key Parameters:
-i, --inputDir: Input directory with phyloseq objects (required)-o, --output: Output directory for any saved results (default: “results/interactive_taxonomy”)
Workflow Notes:
Provides interactive dashboards for exploring taxonomic data
No further modules depend on its output
2.2.9. WMS_FUNCTION#
Purpose: Functional analysis of metagenomic reads
Inputs:
Filtered reads from RAWREAD_QC (or specified input directory)
Metadata file with sample information
Outputs:
Functional annotations in
annotations/directoryPathway analysis in
pathways/directoryVisualization data in
visualization/directory
Key Parameters:
-m, --metadata: Metadata file (required)-s, --sampleIDcolumn: Column in metadata with sample IDs (required)-a, --analysiscolumn: Column for analysis grouping (required)-i, --inputDir: Input directory (optional if following RAWREAD_QC)-o, --output: Output directory (default: “results/metagenome/WMS_FUNCTION”)-p, --processors: Number of CPUs to use (default: 36)
Workflow Notes:
Takes filtered reads from RAWREAD_QC as input
Final module in the read-based functional analysis path
2.2.10. WMS_STRAIN#
Purpose: Strain-level microdiversity analysis using InStrain
Inputs:
Filtered reads from RAWREAD_QC
Phyloseq object from WMS_TAXONOMY (for selecting prevalent taxa)
Outputs:
Nucleotide diversity metrics per genome in
instrain_profiles/directorypN/pS ratio tables for selection pressure analysis
Strain comparison matrices (popANI, conANI)
Preprocessed RDS files for INTERACTIVE_STRAIN
Key Parameters:
-i, --input_dir: Input directory containing filtered reads (required)--phyloseq_object: Phyloseq RDS file from WMS_TAXONOMY (required)-m, --metadata: Path to metadata file (optional, extracted from phyloseq if not provided)-s, --sampleIDcolumn: Column number for sample IDs (default: 1)--prevalence_threshold: Minimum % of samples for prevalence filtering (default: 5)--min_abundance: Minimum relative abundance threshold (default: 0.001)-p, --cpus: Number of CPUs to use
Workflow Notes:
Requires phyloseq output from WMS_TAXONOMY for selecting prevalent taxa
Output is used by INTERACTIVE_STRAIN module
2.2.11. INTERACTIVE_STRAIN#
Purpose: Interactive exploration of strain-level diversity results
Inputs:
Results directory from WMS_STRAIN containing InStrain profiles
Outputs:
Interactive visualization through Shiny web interface
Exported figures and statistical analysis results
Key Parameters:
-i, --input: Input directory with WMS_STRAIN results (required)-m, --metadata: Additional metadata file (optional)-p, --port: Port number for web interface (default: 8050)
Workflow Notes:
Provides interactive dashboards for exploring nucleotide diversity, pN/pS ratios, and strain sharing
No further modules depend on its output
2.2.12. INTERACTIVE_NETWORK#
Purpose: Interactive microbial co-occurrence network analysis
Inputs:
Phyloseq RDS object from WMS_TAXONOMY
Outputs:
Interactive network visualizations
Network comparison statistics across sample groups
Node centrality and influential taxa rankings
Exportable figures and data tables
Key Parameters:
-i, --input: Phyloseq RDS file from WMS_TAXONOMY (required)-p, --port: Port number for web interface (default: 8050)
Workflow Notes:
Supports network inference using FastSpar (SparCC) or FlashWeave
Enables group-wise network comparison for different conditions
No further modules depend on its output
2.3. Tips for Beginners#
Start with RAWREAD_QC: This is the foundation for all workflows.
Follow a single path: Choose either the genome-based or read-based path based on your research question:
Genome-based: For comparative genomics, pangenome analysis, and functional annotation of genomes
Read-based: For direct taxonomic and functional profiling of metagenomic data
Understand data dependencies:
Each module automatically looks for output from previous module
You only need to specify input directory if you’re not following the standard workflow
Metadata files must be manually specified for each module that requires them
Resource management:
Adjust
-pparameter to match your system’s capabilitiesLarge datasets may require more memory, especially for assembly and annotation
Troubleshooting:
Check log files in each module’s output directory
Use
-hoption to get detailed help for each moduleResume interrupted workflows with the same parameters