GENOME_ASSEMBLY

GENOME_ASSEMBLY#

This Nextflow script implements a workflow for quality control of raw reads from whole metagenome sequencing data.

Workflow Execution#

RAWREAD_QC execution#

# Activate metafun conda environment
conda activate metafun
# Move to metafun directory where you cloned metafun from the github. 
(metafun) nextflow run nf_scripts/RAWREAD_QC_apptainer.nf  --inputDir ${inputDir}
# In the ${inputDir} of yours, there should be short paired-end metagenomic reads. 
# gzipped format and fastq or fq suffix could be utilized.

Host read filtering out options.

There are three options filtering the read, human ,Skip filtering out,your custom genome.

Default value is set to human genome. If your dataset is from human genome, you do not need to specify filter.
If you would like to skip host read filtering out, specify --filter none.

Skip host read filtering out step and subsequent FastQC step.#

# If you would like to skip any read filtering out using interested genome, use this one.
 ` nextflow run nf_scripts/RAWREAD_QC_apptainer.nf  --inputDir ${inputDir} --filter none`.

If you would like to your own genome to filter out reads, index it first and specify that.

Use your custom genome to filter out.#

# Index your genome with specific name of yours. We suppose your named it as mygenome.
(metafun) sh generate_custom_filter_usingyourgenome.sh -i ${your_custom genome} -o mygenome
#Specify the filter name as `mygenome` that you used in indexing your genome(mygenome). 
(metafun) nextflow run nf_scripts/RAWREAD_QC_apptainer.nf  --inputDir ${inputDir} --filter mygenome.

Workflow Overview#

The workflow performs the following steps:

FastQC analysis on raw reads.
Trimming and quality filtering using Fastp
Removal of host (e.g., human) reads using Bowtie2 (optional)
FastQC analysis on filtered reads
MultiQC report generation

Result of this workflow is used as input data for ASSEMBLY_BINNING, WMS_TAXONOMY and WMS_FUNCTION. The result of this workflow is mandatory input for WMS_TAXONOMY.

Inputs and Outputs#

Inputs : The raw paired-end short-read metagenomic reads.--inputDir ${inputDir}
Outputs : Quality controlled and filtered metagenomic reads and quality statistical reports visualization results. Default outdir directory is ${launchDir}/results/metagenome/RAWREAD_QC.

Detailes and examples of output files are described at Output file details and examples.

${launchDir} is the directory where you execute nextflow script.
This should be git cloned base folder of metaFun(aababc1/metaFun).

${launchDir} = ${base directory of metafun}
${launchDir}/results/metagenome/RAWREAD_QC = ${params.outdir}

Process	InputDir	OutputDir	Note
fastqc_raw	`${params.inputDir}`	`${params.outdir}/fastqc_raw`	FastQC analysis results for raw reads
fastp	`${params.inputDir}`	`${params.outdir}/fastp`	Performs trimming and quality filtering
humanread_filter	Output from fastp	`${params.outdir}/read_filtered`	Removes host reads (when params.filter is not ‘none’)
fastqc_filtered	Output from humanread_filter	`${params.outdir}/fastqc_filtered`	FastQC analysis results for filtered reads
multiQC	Results from all previous processes	`${params.outdir}/multiqc`	Comprehensive report of all QC results

Switching input and output directory.

If you switch input or output directory by specifying --inputDir ${your input directory} and --outdir ${output directory}, you should have to modify those parameters in other workflows. The default directory is results/metagenome/RAWREAD_QC.

We suppose you only specify input directory by set --inputDir ${inputDir} or switching it in params.file. The defulat base output directory is results/metagenome/RAWREAD_QC.

Parameters in RAWREAD_QC Nextflow Script#

Parameter	Description	Default Value	Note
`db_baseDir`	Base directory for databases	`/opt/database` in apptainer env. `${git base directory}/database` in your server.	This is mounted path in apptainer environment. We do not recommend switching it to another value.
`scripts_baseDir`	Base directory for scripts	`"/scratch/tools/microbiome_analysis/scripts"` in apptainer env. `${git base directory}/scripts` in your server.	This is mounted path in apptainer environment. We do not recommend switching it to another value.
`filter`	Type of read filtering to perform	`'human'`	This is parameter for filtering host read. Default value is human (GRCh38_p12.dna.primary_assembly). You could skip this step by using `--filter none` in command line or modify this value in params file. If you would like to filter out another host genome reads, you could index genome using `sh generate_custom_filter_usingyourgenome.sh -i $yourgenome -o $output`.
`human_index`	Path to human genome index	`"${params.db_baseDir}/host_genome/GRCh38/GRCh38_p12.dna.primary_assembly.fa"`	This is default set value for RAWREAD_QC filtering option. If you need to switch or use other genome, please specify `--filter none` or `--filter ${your genome index}`.
`custom_index`	Path to custom genome index	It is automatically set by `--filter ${value}`. `${params.db_baseDir}/host_genome/${value}/${value}`	You can utilize any genome that you want to filter out read in metagenome. First, you need to index your genome with provided script using `sh generate_custom_filter_usingyourgenome.sh -i $yourgenome -o ${value}`. Then, use it to nextflow run by designating `--filter value` or setting `value` in `params.file`
`outdir`	Output directory	`"${launchDir}/results/metagenome/RAWREAD_QC"`	Once you designated the `outdir`, You should designate ‘inputDir’ in other workflows to be same.
`cpus`	Number of CPUs to use	`4`	Number of cpus to be utilized in each program.
`inputDir`	Input directory containing raw reads	`'raw_reads'`	The location of folder containing short paired-end metagenomic reads. The files could be gzipped and suffix of file could be fastq or fq such as `1.fastq,2.fastq`,`1.fastq.gz,2.fastq.gz`,`1.fq,2.fq`,`1.fq.gz,2.fq.gz`

Descriptions of Processes in RAWREAD_QC Workflow#

FastQC on Raw Reads: Performs FastQC analysis on the raw input metagenomic reads specified by --inputDir ${your input directory}.
Fastp Processing: Trims and filters the metagenomic reads using Fastp.
Host Read Filtering (optional): Removes host reads using Bowtie2. if params.filter is not set to “none”. human is default genome. If your metagenomes do not need to be filtered out any host contamination, specify --filter none to nextflow execution in command line or switch the value in parameter.file. Otherwise If you want to use your own genome of interest, you could index your genome with sh generate_custom_filter_usingyourgenome.sh -i $yourgenome -o ${value} in the metafun directory.
FastQC on Filtered Reads: Performs FastQC analysis on the filtered reads.
MultiQC Report: Generates a MultiQC report combining all QC results.

Tools Used in RAWREAD_QC#

Tool	Purpose	Version	Default paramters	Parameters that can be selected
FastQC	Quality control checks on raw sequence data	v0.12.1	default	you could select only cpus by `--cpus ${number}`
fastp	Trimming and filtering of raw metagenomic reads	0.23.4	default	you could select only cpus by `--cpus ${number}`
Bowtie2	Alignment of reads to remove host contamination	2.5.2	`--very-sensitive`: sensitivity preset, `--un-conc-gz`: gzipped metagenomic reads, unaligned to host genome , `end-to-end`	null
MultiQC	Aggregate results from bioinformatics analyses	v1.18	No specific parameters in this script	null

Usage Notes#

Custom index could be utilized by s by specifying --custom_index in command shell or modify the parameter in the params.file.
The script checks for the existence and non-emptiness of the input directory before proceeding.

Output file details and examples#

If you did not designate output, default output directory is at The workflow generates the following outputs in the specified outdir:

FastQC reports for raw and filtered reads
Fastp filtered reads and JSON report
Host-filtered reads (if applicable)
MultiQC report summarizing all QC steps