COMPARATIVE_ANNOTATION#
This Nextflow script implements a workflow for comparative annotation of metagenome-assembled genomes (MAGs).
COMPARATIVE_ANNOTATION Workflow#
Overview#
This Nextflow script implements a comprehensive and flexible comparative genomic analysis workflow. It performs various analyses including pangenome analysis, functional annotation, and visualization of results, with the ability to skip steps based on existing results.
Key Features#
Dynamic execution based on existing results
Pangenome analysis using PPanGGOLiN
Functional annotations: KEGG (KofamScan), virulence factors (VFDB), antibiotic resistance (CARD), CAZymes (dbCAN)
Genome similarity analysis using Skani
Protein function prediction using eggNOG-mapper
Gene-trait association analysis using Scoary2
Flexible visualization for both gene presence/absence and gene count data
Inputs#
Quality-controlled genome assemblies (FASTAs)
Metadata file with sample information
Outputs#
Annotation results for each analysis tool
Visualization results including interactive plots and heatmaps
Separate visualizations for gene presence/absence and gene count data
Main Parameters#
Parameter |
Description |
Default Value |
|---|---|---|
|
Input directory for genome files |
|
|
Path to metadata file |
|
|
Column number in metadata to use |
Required (no default) |
|
Output directory |
|
|
Number of CPUs to use |
40 |
|
PPanGGOLiN identity threshold |
0.8 |
|
PPanGGOLiN coverage threshold |
0.8 |
|
KEGG module completeness threshold |
0.5 |
|
KEGG KO e-value threshold |
0.00001 |
|
VFDB identity threshold |
50 |
|
VFDB coverage threshold |
80 |
|
VFDB e-value threshold |
1e-10 |
|
CAZyme HMM e-value threshold |
1e-15 |
|
CAZyme HMM coverage threshold |
0.35 |
Workflow Structure#
The workflow dynamically determines which steps to run based on the existence of previous results:
Genome Preparation: Selects and prepares input genomes based on metadata.
Annotation:
Prokka for gene prediction (if not already run)
PPanGGOLiN for pangenome analysis (if not already run)
KofamScan for KEGG annotation
VFDB for virulence factor annotation
CARD for antibiotic resistance gene detection
dbCAN for CAZyme annotation
Skani for genome similarity analysis
eggNOG for protein function prediction
Visualization:
Creates visualizations for both gene presence/absence and gene count data
Includes heatmaps and interactive plots for each annotation type
Additional Analyses:
Scoary2 for gene-trait association analysis
Metadata summary creation
Key Processes and Their Outputs#
Process |
Output |
|---|---|
|
Prokka annotation files |
|
Pangenome analysis results |
|
KO matrix |
|
VFDB annotation results (PA and count) |
|
CARD annotation results (PA and count) |
|
CAZyme annotation results (PA and count) |
|
Genome similarity matrix |
|
eggNOG annotation results |
Visualization Processes#
Each visualization process creates separate outputs for gene presence/absence (PA) and gene count data:
run_kofamscan_visualizationrun_VFDB_visualizationrun_rgi_CARD_visualizationrun_dbCAN_visualizationrun_skani_visualization
Usage#
nextflow run <span style="color:#7FBDFF">COMPARATIVE_ANNOTATION</span>_apptainer.nf --metadata [path_to_metadata] --metacol [metadata_column] [additional_options]
Requirements#
Nextflow
Apptainer/Singularity containers with required tools
Input genomes and metadata file
For detailed information on parameters and usage, run the script with the --help flag.
Notes#
The workflow uses Apptainer (formerly Singularity) containers for tool execution.
Results are organized in a structured output directory for easy navigation.
The script includes error checking for critical parameters and input files.
Visualization results are designed to be interactive and easily interpretable.
The workflow is flexible and can resume from partial runs, optimizing resource usage.