COMPARATIVE_ANNOTATION#

This module is a part of metaFun pipeline, designed for comparative genomic analysis and functional annotation of metagenome-assembled genomes (MAGs).

Overview#

The COMPARATIVE_ANNOTATION module provides comprehensive analysis of genome functions and comparison across different samples or conditions. It performs pangenome analysis, functional annotation of genes, and generates visualizations for comparative genomics. The module identifies core, accessory, and unique genes, and annotates them with various functional databases including KEGG, VFDB, CARD, CAZymes, and eggNOG.

A key output of this module is the integrated sequence database (HDF5 format), which serves as the foundation for interactive exploration in the INTERACTIVE_COMPARATIVE module. This database links all annotations, gene sequences, and metadata, enabling powerful on-demand analyses and visualization.

Module Execution#

# Basic usage with only annotation (recommended)
(metafun) metafun -module COMPARATIVE_ANNOTATION -i genomes/ -m metadata.csv --samplecol 1

# Include metabolic pathway analysis with visualization (If you want to generate static plots.)
(metafun) metafun -module COMPARATIVE_ANNOTATION -i genomes/ -m metadata.csv --samplecol 1 --metacol 2

# Customize parameters for specific analysis
(metafun) metafun -module COMPARATIVE_ANNOTATION -i genomes/ -m metadata.csv --samplecol 1 --pan_identity 0.8 --pan_coverage 0.8

Recommended workflow

For better performance and flexibility, it is recommended to use a two-step approach:

  1. First run COMPARATIVE_ANNOTATION with just annotation (only specify –samplecol, not –metacol)

  2. Then explore the results interactively using metafun -module INTERACTIVE_COMPARATIVE -i results/path

This approach avoids generating large numbers of static plots that may not be useful and interactive module allows more dynamic exploration of your data.

Module Operation Sequence#

This module performs the following steps:

  1. Genome preparation: Processing input genomes and metadata

  2. Gene prediction: Using Prokka to predict genes in each genome

  3. Pangenome analysis: Using PPanGGOLiN to identify core and accessory genes

  4. Functional annotation:

    • KEGG Orthology annotation with KofamScan

    • Virulence factor detection using VFDB

    • Antibiotic resistance gene identification using CARD

    • Carbohydrate-active enzyme annotation with dbCAN

    • Protein function prediction using eggNOG-mapper

  5. Comparative analysis:

    • Genome similarity calculation using skani

    • Genome dereplication using dRep

    • Gene presence/absence clustering and visualization

  6. Sequence database creation: Building an integrated HDF5 database that links all annotations, sequences, and metadata (critical for INTERACTIVE_COMPARATIVE)

  7. Visualization: Generation of interactive plots and heatmaps for all annotations

  8. Statistical analysis: Using Scoary2 for gene-trait associations

Parameters#

${launchDir} is the directory where you execute metaFun, and utilized as output base directory.

Parameter

Description

Default Value

Note

-i, --inputDir

Input directory containing genome files

${launchDir}/results/metagenome/BIN_ASSESSMENT/bins_quality_passed

Can also use genome_selector_result.csv output genomes

-m, --metadata

Path to metadata file

Required

CSV or TSV file with sample information

--samplecol

Column in metadata with sample identifiers

Required

Matches sample IDs in genome filenames

--metacol

Column in metadata for statistical analysis

Optional

If not specified, only annotation is performed

-p, --processors

Number of CPUs to use

40

Adjust based on your system capabilities

--module_completeness

KEGG module completeness threshold

0.5

Fraction of KOs needed to consider a module complete

--pan_identity

PPanGGOLiN identity threshold

0.8

Sequence identity for gene clustering

--pan_coverage

PPanGGOLiN coverage threshold

0.8

Sequence coverage for gene clustering

--kingdom

Kingdom for annotation

bacteria

Options: bacteria, archaea

--kofamscan_eval

KEGG KO e-value threshold

0.00001

Threshold for KofamScan matches

--VFDB_identity

VFDB identity threshold

50

Percentage identity for virulence factors

--VFDB_coverage

VFDB coverage threshold

80

Percentage coverage for virulence factors

--VFDB_e_value

VFDB e-value threshold

1e-10

E-value threshold for virulence factors

--CAZyme_hmm_eval

CAZyme HMM e-value threshold

1e-15

E-value threshold for CAZyme detection

--CAZyme_hmm_cov

CAZyme HMM coverage threshold

0.35

Coverage threshold for CAZyme detection

--run_drep

Whether to run dRep

true

Set to false to skip dereplication

--drep_ani

dRep ANI threshold

0.995

Average nucleotide identity threshold for subspecies level dereplication

--drep_cov

dRep coverage threshold

0.3

Genome coverage threshold

--drep_algorithm

dRep algorithm

skani

Algorithm for ANI calculation

Inputs and Outputs#

Inputs#

  • Genome FASTA files (output folder from BIN_ASSESSMENT)

  • Metadata file (CSV or TSV format) with sample information and conditions (selected genomes metadata by GENOME_SELECTOR )

Outputs#

  • Annotated genes for each genome

  • Pangenome analysis results

  • Functional annotations (KEGG, VFDB, CARD, CAZymes, eggNOG)

  • Genome similarity matrix and dereplication results

  • Comparative visualizations (static and interactive)

  • Gene-trait associations results

  • Integrated sequence database (HDF5) - Critical for INTERACTIVE_COMPARATIVE module

Output directory structure#

The output is organized in a timestamped directory under ${launchDir}/results/metagenome/COMPARATIVE_ANNOTATION/:

Output directory structure#
${launchDir}/results/metagenome/COMPARATIVE_ANNOTATION/YYYYMMDDHHMMSS/
├── selected_genomes/                     # Processed input genomes
├── prokka/                               # Prokka gene predictions   ├── [genome1]/
│      ├── [genome1].ffn                 # Nucleotide sequences      ├── [genome1].faa                 # Protein sequences      ├── [genome1].gff                 # Genome annotations      └── ...
│   ├── [genome2]/
│   └── ...
├── ppanggolin_result/                    # Pangenome analysis results   ├── pangenome.h5                      # Pangenome database   ├── gene_presence_absence.Rtab        # Gene presence/absence matrix   ├── gene_count_matrix.tsv             # Gene count matrix   ├── gene_families.tsv                 # Gene family information   └── ...
├── annotation_results/                   # Annotation results for all tools   ├── kofamscan/                        # KEGG Orthology annotations      ├── ko_matrix.csv                 # KO presence/absence matrix      └── KO_definition_GeneID_countgenomes.csv # KO definitions   ├── VFDB/                             # Virulence factor annotations      ├── pangene_vfdb_result.txt       # Raw VFDB results      ├── gene_PA_VFDB_added.csv        # Virulence factor presence/absence      └── gene_count_VFDB_added.csv     # Virulence factor counts   ├── CARD/                             # Antibiotic resistance annotations      ├── pangene_rgi_CARD_result.txt   # Raw RGI results      ├── gene_PA_CARD_added.csv        # ARG presence/absence      └── gene_count_CARD_added.csv     # ARG counts   ├── dbCAN/                            # CAZyme annotations      ├── db_can_out/                   # Raw dbCAN results      ├── dbcan_HMMER_count_gene_PA_matrix.csv    # CAZyme presence/absence      ├── dbcan_HMMER_count_gene_count_matrix.csv # CAZyme counts      └── ...
│   ├── ani/                              # Genome similarity analysis      ├── skani_fullmatrix             # Genome similarity matrix      └── skani_ANI_dist.tsv           # ANI distance matrix   └── eggNOG/                           # Protein function annotations      ├── eggnog_mmseqs.emapper.annotations       # eggNOG annotations      └── ...
├── visualization_results/                # Generated plots and figures   ├── kofamscan/                        # KEGG visualization      ├── column_*/                     # Visualizations by metadata column      ├── KEGG_module_visualization_shiny/  # Interactive KEGG visualization      └── KEGG_module_completeness.csv  # Module completeness data   ├── VFDB/                             # Virulence factor visualization      ├── heatmap_VFDB_gene_PA_*.pdf    # Static VFDB heatmaps      └── VFDB_interactive_*/          # Interactive VFDB visualization   ├── CARD/                             # Antibiotic resistance visualization      ├── heatmap_CARD_gene_PA_*.pdf    # Static CARD heatmaps      └── CARD_interactive_*/          # Interactive CARD visualization   ├── dbCAN/                            # CAZyme visualization      ├── heatmap_dbCAN_gene_PA_*.pdf   # Static dbCAN heatmaps      └── dbCAN_interactive_*/         # Interactive dbCAN visualization   ├── defensefinder/                    # Defense system visualization      ├── heatmap_defensefinder_*.pdf   # Static defense system heatmaps      └── defensefinder_interactive_*/  # Interactive defense system visualization   ├── ani/                              # Genome similarity visualization      ├── column_*/                     # Visualizations by metadata column      ├── heatmap_skani.pdf             # Static skani heatmap      └── skani_interactive/           # Interactive skani visualization   └── scoary2/                          # Gene-trait association results       └── scoary_out/                   # Scoary output files
├── genePA_cluster/                       # Gene presence/absence clustering   └── pcoa_plot_interactive.html        # PCoA plot of gene presence/absence
├── drep/                                 # Genome dereplication results   ├── drep_output/                      # dRep output files   ├── dereplicated_genomes/            # Dereplicated genome files   └── subspecies_clusters.tsv          # Subspecies cluster information
└── sequence_db/                          # Sequence database
    └── sequences.h5                      # HDF5 database of sequences

Execution Examples and Results#

metaFun command line execution example#

metafun_pipeline

Interactive visualization of results

After running COMPARATIVE_ANNOTATION, you can explore the results interactively using:

metafun -module INTERACTIVE_COMPARATIVE -i ${launchDir}/results/metagenome/COMPARATIVE_ANNOTATION/YYYYMMDDHHMMSS

This launches an interactive interface for exploring annotations, comparing genomes, and generating custom visualizations.

Example visualizations#

The module generates various visualizations including:

  • PCA plots of functional profiles

  • Heatmaps of gene presence/absence

  • Hierarchical clustering of genomes

  • Functional enrichment plots

  • Genome similarity networks

Key Processes in COMPARATIVE_ANNOTATION Module#

Process

Purpose

Input

Output

prepare_genomes

Prepares genomes for analysis

Metadata file, input genome directory

Selected genome files

create_metadata_summary

Creates metadata information

Metadata file

Column summary file

run_prokka

Gene prediction

Genome FASTA files

Predicted genes and proteins

run_ppanggolin

Pangenome analysis

Prokka output

Core and accessory genes

run_panaroo

Alternative pangenome analysis

Prokka output

Pangenome results

run_genePA_cluster

Gene presence/absence clustering

Pangenome, metadata

PCoA visualization

run_kofamscan_annotation

KEGG annotation

Protein sequences

KO annotations

run_kofamscan_visualization

KEGG visualization

KO matrix, metadata

Interactive KEGG visualizations

run_VFDB_annotation

Virulence factor annotation

Protein sequences

Virulence factor identification

run_VFDB_visualization

Virulence factor visualization

VFDB results, metadata

VFDB heatmaps and interactive plots

run_rgi_CARD_annotation

Antibiotic resistance annotation

Protein sequences

Resistance gene identification

run_rgi_CARD_visualization

Resistance gene visualization

CARD results, metadata

CARD heatmaps and interactive plots

run_defensefinder_annotation

Defense system analysis

Protein sequences

Defense system identification

run_defensefinder_visualization

Defense system visualization

Defense finder results, metadata

Defense system heatmaps

run_dbCAN_annotation

CAZyme annotation

Protein sequences

Carbohydrate-active enzyme annotation

run_dbCAN_visualization

CAZyme visualization

dbCAN results, metadata

CAZyme heatmaps and interactive plots

run_skani_annotation

Genome similarity calculation

Genome FASTA files

ANI similarity matrix

run_skani_visualization

Genome similarity visualization

Skani matrix, metadata

ANI heatmaps

run_eggNOG

Protein function prediction

Protein sequences

Detailed functional annotations

run_scoary2

Gene-trait association

Gene matrix, metadata

Statistically significant associations

run_drep_dereplication

Genome dereplication

Genome FASTA files

Dereplicated genomes and clusters

make_sequence_db

Creates integrated sequence database

Prokka outputs, pangenome data, metadata

HDF5 database that links all annotations, sequences and metadata - CRITICAL for INTERACTIVE_COMPARATIVE module

create_shiny_dashboard

Dashboard creation

Visualization results

Interactive dashboard for results

run_multiqc

Report generation

Visualization results

Summary HTML report

Tools Used in COMPARATIVE_ANNOTATION#

Tool

Purpose

Version

Parameters

Prokka

Gene prediction

1.14.6

Kingdom-specific parameters based on --kingdom

PPanGGOLiN

Pangenome analysis

2.0.5

Identity (--pan_identity) and coverage (--pan_coverage) thresholds

KofamScan

KEGG annotation

1.3.0

E-value threshold (--kofamscan_eval)

DIAMOND

Sequence alignment

0.8.36

Used for VFDB and other database searches

RGI (CARD)

Resistance gene identification

6.0.3

Default parameters

dbCAN

CAZyme annotation

4.1.4

E-value (--CAZyme_hmm_eval) and coverage (--CAZyme_hmm_cov) thresholds

skani

Genome similarity

0.2.1

Default parameters

dRep

Genome dereplication

3.5.0

ANI threshold (--drep_ani), coverage (--drep_cov), algorithm (--drep_algorithm)

eggNOG-mapper

Protein function prediction

2.1.12

Default parameters

Scoary2

Gene-trait association

0.0.15

Default parameters

R packages

Visualization and statistics

4.3.2

Various R packages for visualization

rHDF5

Sequence database storage

1.24.0

Used for creating the integrated sequence database with metadata

Usage Notes#

  • The COMPARATIVE_ANNOTATION module works best with high-quality genomes selected from the BIN_ASSESSMENT module or filtered using the GENOME_SELECTOR module.

  • For large datasets (>100 genomes), increasing the number of processors (-p) can significantly reduce run time.

  • When specifying metadata columns, use the column number (1-based index in the parameter) for --metacol and --samplecol.

  • The --samplecol parameter should point to a column containing identifiers that match the prefixes of your genome filenames.

  • If --metacol is not specified, the module will perform annotation but skip statistical analysis and many visualizations.

  • The module processes data in two main stages: annotation (always performed) and visualization (only if --metacol is specified).

  • The sequence database creation (make_sequence_db) is a critical step that enables the INTERACTIVE_COMPARATIVE module to function efficiently.

  • For each functional category (KEGG, VFDB, CARD, CAZymes, defense systems), both presence/absence and count matrices are analyzed.

  • Genome dereplication with dRep is useful to remove highly similar genomes that might bias comparative analyses.

  • You can customize annotation thresholds to make them more or less stringent based on your research needs.

Next Steps#

After running COMPARATIVE_ANNOTATION, you can:

  1. Explore results interactively using the INTERACTIVE_COMPARATIVE module:

    metafun -module INTERACTIVE_COMPARATIVE -i results/metagenome/COMPARATIVE_ANNOTATION/YYYYMMDDHHMMSS
    

    The INTERACTIVE_COMPARATIVE module relies on the sequence database (sequences.h5) created by this module to provide dynamic and customizable analysis.

  2. Perform deeper analysis of specific functions of interest:

    • Investigate KEGG pathways enriched in specific samples

    • Examine virulence factors unique to certain conditions

    • Analyze the distribution of antibiotic resistance genes

    • Study carbohydrate utilization potential across genomes

  3. Use the annotated genomes for other analyses:

    • Custom scripts for specific research questions

    • Integration with other tools or platforms

    • Publication-quality figure generation

  4. Explore the sequence database:

    • The HDF5 sequence database is the backbone of the INTERACTIVE_COMPARATIVE module

    • Contains all gene sequences, protein translations, and functional annotations in a linked format

    • Enables fast retrieval and comparison of genes across genomes

    • Supports on-demand analysis and visualization in the interactive interface

The COMPARATIVE_ANNOTATION module provides a comprehensive foundation for comparative genomic analysis, enabling researchers to gain insights into the functional potential and differences across genomes from various conditions or environments.