RAWREAD_QC

RAWREAD_QC#

This module is a part of metaFun pipeline, designed for quality control of raw reads from whole metagenome sequencing data.

Overview#

The RAWREAD_QC module is the first step in the metaFun pipeline, designed for quality control of raw metagenomic sequencing data. It performs quality assessment, trimming, and host read filtering to prepare high-quality reads for downstream analyses. Any host read could be indexed and utilized by indexing using metaFun or you can skip host read filtering process.

Module Execution#

# Basic usage
(metafun) metafun -module RAWREAD_QC -i /path/to/raw_reads

# Skip host filtering
(metafun) metafun -module RAWREAD_QC -i /path/to/raw_reads --filter none

# Use custom host genome for filtering (prepare your genome first).
# Index your genome and run RAWREAD_QC module.
(metafun) metafun -module PREPARE_CUSTOM_HOST -i yourgenome.fasta -f custom_genome_name
(metafun) metafun -module RAWREAD_QC -i /path/to/raw_reads --filter custom_genome_name

Host read filtering out options.

There are three options filtering the read, human ,Skip filtering out,your custom genome.

Default value is set to human genome. If your dataset is from human genome, you do not need to specify filter.
If you would like to skip host read filtering out, specify -f none or --filter none.

Skip host read filtering out step and subsequent FastQC step.#

# If you would like to skip any read filtering out using interested genome, use this one.
 (metafun) metafun -module RAWREAD_QC -i ${inputDir} -f none

If you would like to your own genome to filter out reads, index it first and specify that.

Use your custom genome to filter out.#

# Index your genome with specific name of yours. Any name could be utilized with `-f` , but you need to use this word in RAWREAD_QC module word.
(metafun) metafun -module PREPARE_CUSTOM_HOST -i yourgenome.fasta -f anyname
#Specify the filter name as `mygenome` that you used in indexing your genome(mygenome). 
(metafun) metafun -module RAWREAD_QC -i ${inputDir} -f anyname

Module Operation Sequence#

The module performs the following steps:

FastQC analysis on raw reads.
Trimming and quality filtering using fastp
Removal of host (e.g., human) reads using Bowtie2 (optional)
FastQC analysis on filtered reads
MultiQC report generation

Parameters#

${launchDir} is the directory where you execute metaFun, and utilized as output base directory.

Parameter	Description	Default Value	Note
`-i, --inputDir`	Input directory containing raw reads	You should set directory path contaning raw reads.	Required
`-o, --outdir`	Output directory	`${launchDir}/results/metagenome/RAWREAD_QC`	Defult output directory is recommended for downstream analysis.
`-f, --filter`	Host genome filtering option	`human`	Options: `human`, `none`, or custom name
`-p, --processors`	Number of CPUs to use	`4`

Inputs and Outputs#

Inputs#

Raw paired-end metagenomic reads (FASTQ format, can be gzipped)
Specify directory of input reads with -i ${inputDir}
File naming convention: *_1.fastq.gz/*_2.fastq.gz or *_1.fq.gz/*_2.fq.gz or without gzipped suffix.

Outputs#

Quality-controlled and filtered metagenomic reads
Quality assessment reports and visualizations

Output directory structure#

Output directory is at ${launchDir} or your specified directory path with -o outdir.

We suppose you did not designate output directory as we recommended.

${launchDir} is the directory where you execute metaFun, and utilized as output base directory.

Switching input and output directory.

If you defined output directory by specifying -o ${output directory}, you should have to modify those parameters in other modules too. The default output directory is results/metagenome/RAWREAD_QC. in your ${launchDir}.

Output directory structure#

 # input samples are in  ${inputDir}, and input fastq file names may be ${sample_id}_{1,2}.fastq.gz or ${sample_id}_{1,2}.fq.gz or without gzipped suffix.

${launchDir}/results/metagenome/RAWREAD_QC/
├── fastqc_raw/                      # FastQC results for raw reads
│   ├── ${sample_id}_1_fastqc.html
│   ├── ${sample_id}_1_fastqc.zip 
│   ├── ${sample_id}_2_fastqc.html
│   ├── ${sample_id}_2_fastqc.zip
│   ├── ...
├── read_filtered/                   # Host-filtered reads
│   ├── ${sample_id}_fastp_hg38_1.fastq.gz
│   └── ${sample_id}_fastp_hg38_2.fastq.gz
├── fastqc_filtered/                 # FastQC results for filtered reads
│   ├── ${sample_id}_fastp_hg38_fastqc_filtered/
│   │   ├── ${sample_id}_fastp_hg38_1_fastqc.html
│   │   ├── ${sample_id}_fastp_hg38_1_fastqc.zip
│   │   ├── ${sample_id}_fastp_hg38_2_fastqc.html
│   │   ├── ${sample_id}_fastp_hg38_2_fastqc.zip
│   ├── ...
└── multiqc/                         # MultiQC report
    ├── multiqc_report.html
    └── multiqc_data/
        └── ...

Execution Examples and Results#

metaFun command line execution example#

MultiQC report example#

MultiQC report integrates raw read and filtered read quality statistics across all samples.

Example of HTML multiQC report!

Nextflow Processes in RAWREAD_QC Module#

Process	InputDir	OutputDir	Note
fastqc_raw	`${params.inputDir}`	`${params.outdir}/fastqc_raw`	FastQC analysis results for raw reads
fastp	`${params.inputDir}`	None	Performs trimming and quality filtering
humanread_filter	Output from fastp	`${params.outdir}/read_filtered`	Removes host reads (when params.filter is not ‘none’)
fastqc_filtered	Output from humanread_filter	`${params.outdir}/fastqc_filtered`	FastQC analysis results for filtered reads
multiQC	Results from all previous processes	`${params.outdir}/multiqc`	Comprehensive report of all QC results

Descriptions of Processes in RAWREAD_QC Module#

FastQC on Raw Reads: Performs FastQC analysis on the raw input metagenomic reads specified by --inputDir ${your input directory}.
fastp Processing: Trims and filters the metagenomic reads using fastp.
Host Read Filtering (optional): Removes host reads using Bowtie2. Default input genome is GRCh38_p12.dna.primary_assembly. (-f human ). If you do not want to filter out any host reads, specify -f none in command line. Otherwise, if you want to filter out your own genome of interest, you could index your genome with (metafun) metafun -module PREPARE_CUSTOM_HOST -i $yourgenome -f${value} .
FastQC on Filtered Reads: Performs FastQC analysis on the filtered reads.
MultiQC Report: Generates a MultiQC report combining all QC results.

Tools Used in RAWREAD_QC#

Tool	Purpose	Version	Default paramters	Parameters that can be selected
FastQC	Quality control checks on raw sequence data	v0.12.1	default	you could select only cpus by `--cpus ${number}`
fastp	Trimming and filtering of raw metagenomic reads	0.23.4	default	you could select only cpus by `--cpus ${number}`
Bowtie2	Alignment of reads to remove host contamination	2.5.2	`--very-sensitive`: sensitivity preset, `--un-conc-gz`: gzipped metagenomic reads, unaligned to host genome , `end-to-end`	null
MultiQC	Aggregate results from bioinformatics analyses	v1.18	No specific parameters in this script	null

Usage Notes#

Custom index could be generated by specifying -f name using metafun -module PREPARE_CUSTOM_HOST -i yourgenome.fasta -f name
The script checks for the existence and non-emptiness of the input directory before proceeding.
Result of this module is used as input data for ASSEMBLY_BINNING, WMS_TAXONOMY and WMS_FUNCTION. The result of this module is mandatory input for WMS_TAXONOMY for Kraken2 with Bracken.

RAWREAD_QC

Contents

RAWREAD_QC#

Overview#

Module Execution#

Module Operation Sequence#

Parameters#

Inputs and Outputs#

Inputs#

Outputs#

Output directory structure#

Execution Examples and Results#

metaFun command line execution example#

MultiQC report example#

Nextflow Processes in RAWREAD_QC Module#

Descriptions of Processes in RAWREAD_QC Module#

Tools Used in RAWREAD_QC#

Usage Notes#