RAWREAD_QC#
This module is a part of metaFun pipeline, designed for quality control of raw reads from whole metagenome sequencing data.
Overview#
The RAWREAD_QC module is the first step in the metaFun pipeline, designed for quality control of raw metagenomic sequencing data. It performs quality assessment, trimming, and host read filtering to prepare high-quality reads for downstream analyses. Any host read could be indexed and utilized by indexing using metaFun or you can skip host read filtering process.
Module Execution#
# Basic usage
(metafun) metafun -module RAWREAD_QC -i /path/to/raw_reads
# Skip host filtering
(metafun) metafun -module RAWREAD_QC -i /path/to/raw_reads --filter none
# Use custom host genome for filtering (prepare your genome first).
# Index your genome and run RAWREAD_QC module.
(metafun) metafun -module PREPARE_CUSTOM_HOST -i yourgenome.fasta -f custom_genome_name
(metafun) metafun -module RAWREAD_QC -i /path/to/raw_reads --filter custom_genome_name
Host read filtering out options.
There are three options filtering the read, human ,Skip filtering out,your custom genome.
Default value is set to human genome. If your dataset is from human genome, you do not need to specify filter.
If you would like to skip host read filtering out, specify
-f noneor--filter none.
# If you would like to skip any read filtering out using interested genome, use this one.
(metafun) metafun -module RAWREAD_QC -i ${inputDir} -f none
If you would like to your own genome to filter out reads, index it first and specify that.
# Index your genome with specific name of yours. Any name could be utilized with `-f` , but you need to use this word in RAWREAD_QC module word.
(metafun) metafun -module PREPARE_CUSTOM_HOST -i yourgenome.fasta -f anyname
#Specify the filter name as `mygenome` that you used in indexing your genome(mygenome).
(metafun) metafun -module RAWREAD_QC -i ${inputDir} -f anyname
Module Operation Sequence#
The module performs the following steps:
FastQC analysis on raw reads.
Trimming and quality filtering using fastp
Removal of host (e.g., human) reads using Bowtie2 (optional)
FastQC analysis on filtered reads
MultiQC report generation
Parameters#
${launchDir} is the directory where you execute metaFun, and utilized as output base directory.
Parameter |
Description |
Default Value |
Note |
|---|---|---|---|
|
Input directory containing raw reads |
You should set directory path contaning raw reads. |
Required |
|
Output directory |
|
Defult output directory is recommended for downstream analysis. |
|
Host genome filtering option |
|
Options: |
|
Number of CPUs to use |
|
Inputs and Outputs#
Inputs#
Raw paired-end metagenomic reads (FASTQ format, can be gzipped)
Specify directory of input reads with
-i ${inputDir}File naming convention:
*_1.fastq.gz/*_2.fastq.gzor*_1.fq.gz/*_2.fq.gzor without gzipped suffix.
Outputs#
Quality-controlled and filtered metagenomic reads
Quality assessment reports and visualizations
Output directory structure#
Output directory is at ${launchDir} or your specified directory path with -o outdir.
We suppose you did not designate output directory as we recommended.
${launchDir} is the directory where you execute metaFun, and utilized as output base directory.
Switching input and output directory.
If you defined output directory by specifying -o ${output directory}, you should have to modify those parameters in other modules too.
The default output directory is results/metagenome/RAWREAD_QC. in your ${launchDir}.
# input samples are in ${inputDir}, and input fastq file names may be ${sample_id}_{1,2}.fastq.gz or ${sample_id}_{1,2}.fq.gz or without gzipped suffix.
${launchDir}/results/metagenome/RAWREAD_QC/
├── fastqc_raw/ # FastQC results for raw reads
│ ├── ${sample_id}_1_fastqc.html
│ ├── ${sample_id}_1_fastqc.zip
│ ├── ${sample_id}_2_fastqc.html
│ ├── ${sample_id}_2_fastqc.zip
│ ├── ...
├── read_filtered/ # Host-filtered reads
│ ├── ${sample_id}_fastp_hg38_1.fastq.gz
│ └── ${sample_id}_fastp_hg38_2.fastq.gz
├── fastqc_filtered/ # FastQC results for filtered reads
│ ├── ${sample_id}_fastp_hg38_fastqc_filtered/
│ │ ├── ${sample_id}_fastp_hg38_1_fastqc.html
│ │ ├── ${sample_id}_fastp_hg38_1_fastqc.zip
│ │ ├── ${sample_id}_fastp_hg38_2_fastqc.html
│ │ ├── ${sample_id}_fastp_hg38_2_fastqc.zip
│ ├── ...
└── multiqc/ # MultiQC report
├── multiqc_report.html
└── multiqc_data/
└── ...
Execution Examples and Results#
metaFun command line execution example#
MultiQC report example#
MultiQC report integrates raw read and filtered read quality statistics across all samples.
Nextflow Processes in RAWREAD_QC Module#
Process |
InputDir |
OutputDir |
Note |
|---|---|---|---|
fastqc_raw |
|
|
FastQC analysis results for raw reads |
fastp |
|
None |
Performs trimming and quality filtering |
humanread_filter |
Output from fastp |
|
Removes host reads (when params.filter is not ‘none’) |
fastqc_filtered |
Output from humanread_filter |
|
FastQC analysis results for filtered reads |
multiQC |
Results from all previous processes |
|
Comprehensive report of all QC results |
Descriptions of Processes in RAWREAD_QC Module#
FastQC on Raw Reads: Performs FastQC analysis on the raw input metagenomic reads specified by
--inputDir ${your input directory}.fastp Processing: Trims and filters the metagenomic reads using fastp.
Host Read Filtering (optional): Removes host reads using Bowtie2. Default input genome is GRCh38_p12.dna.primary_assembly. (
-f human). If you do not want to filter out any host reads, specify-f nonein command line. Otherwise, if you want to filter out your own genome of interest, you could index your genome with(metafun) metafun -module PREPARE_CUSTOM_HOST -i $yourgenome -f${value}.FastQC on Filtered Reads: Performs FastQC analysis on the filtered reads.
MultiQC Report: Generates a MultiQC report combining all QC results.
Tools Used in RAWREAD_QC#
Tool |
Purpose |
Version |
Default paramters |
Parameters that can be selected |
|---|---|---|---|---|
FastQC |
Quality control checks on raw sequence data |
v0.12.1 |
default |
you could select only cpus by |
fastp |
Trimming and filtering of raw metagenomic reads |
0.23.4 |
default |
you could select only cpus by |
Bowtie2 |
Alignment of reads to remove host contamination |
2.5.2 |
|
null |
MultiQC |
Aggregate results from bioinformatics analyses |
v1.18 |
No specific parameters in this script |
null |
Usage Notes#
Custom index could be generated by specifying
-f nameusingmetafun -module PREPARE_CUSTOM_HOST -i yourgenome.fasta -f nameThe script checks for the existence and non-emptiness of the input directory before proceeding.
Result of this module is used as input data for ASSEMBLY_BINNING, WMS_TAXONOMY and WMS_FUNCTION. The result of this module is mandatory input for WMS_TAXONOMY for Kraken2 with Bracken.