RAWREAD_QC#

This module is a part of metaFun pipeline, designed for quality control of raw reads from whole metagenome sequencing data.

Overview#

The RAWREAD_QC module is the first step in the metaFun pipeline, designed for quality control of raw metagenomic sequencing data. It performs quality assessment, trimming, and host read filtering to prepare high-quality reads for downstream analyses. Any host read could be indexed and utilized by indexing using metaFun or you can skip host read filtering process.

Module Execution#

# Basic usage
(metafun) metafun -module RAWREAD_QC -i /path/to/raw_reads

# Skip host filtering
(metafun) metafun -module RAWREAD_QC -i /path/to/raw_reads --filter none

# Use custom host genome for filtering (prepare your genome first).
# Index your genome and run RAWREAD_QC module.
(metafun) metafun -module PREPARE_CUSTOM_HOST -i yourgenome.fasta -f custom_genome_name
(metafun) metafun -module RAWREAD_QC -i /path/to/raw_reads --filter custom_genome_name

Host read filtering out options.

There are three options filtering the read, human ,Skip filtering out,your custom genome.

  • Default value is set to human genome. If your dataset is from human genome, you do not need to specify filter.

  • If you would like to skip host read filtering out, specify -f none or --filter none.

Skip host read filtering out step and subsequent FastQC step.#
# If you would like to skip any read filtering out using interested genome, use this one.
 (metafun) metafun -module RAWREAD_QC -i ${inputDir} -f none
  • If you would like to your own genome to filter out reads, index it first and specify that.

Use your custom genome to filter out.#
# Index your genome with specific name of yours. Any name could be utilized with `-f` , but you need to use this word in RAWREAD_QC module word.
(metafun) metafun -module PREPARE_CUSTOM_HOST -i yourgenome.fasta -f anyname
#Specify the filter name as `mygenome` that you used in indexing your genome(mygenome). 
(metafun) metafun -module RAWREAD_QC -i ${inputDir} -f anyname

Module Operation Sequence#

The module performs the following steps:

  1. FastQC analysis on raw reads.

  2. Trimming and quality filtering using fastp

  3. Removal of host (e.g., human) reads using Bowtie2 (optional)

  4. FastQC analysis on filtered reads

  5. MultiQC report generation

Parameters#

${launchDir} is the directory where you execute metaFun, and utilized as output base directory.

Parameter

Description

Default Value

Note

-i, --inputDir

Input directory containing raw reads

You should set directory path contaning raw reads.

Required

-o, --outdir

Output directory

${launchDir}/results/metagenome/RAWREAD_QC

Defult output directory is recommended for downstream analysis.

-f, --filter

Host genome filtering option

human

Options: human, none, or custom name

-p, --processors

Number of CPUs to use

4

Inputs and Outputs#

Inputs#

  • Raw paired-end metagenomic reads (FASTQ format, can be gzipped)

  • Specify directory of input reads with -i ${inputDir}

  • File naming convention: *_1.fastq.gz/*_2.fastq.gz or *_1.fq.gz/*_2.fq.gz or without gzipped suffix.

Outputs#

  • Quality-controlled and filtered metagenomic reads

  • Quality assessment reports and visualizations

Output directory structure#

Output directory is at ${launchDir} or your specified directory path with -o outdir.

We suppose you did not designate output directory as we recommended.

${launchDir} is the directory where you execute metaFun, and utilized as output base directory.

Switching input and output directory.

If you defined output directory by specifying -o ${output directory}, you should have to modify those parameters in other modules too. The default output directory is results/metagenome/RAWREAD_QC. in your ${launchDir}.

Output directory structure#
 # input samples are in  ${inputDir}, and input fastq file names may be ${sample_id}_{1,2}.fastq.gz or ${sample_id}_{1,2}.fq.gz or without gzipped suffix.

${launchDir}/results/metagenome/RAWREAD_QC/
├── fastqc_raw/                      # FastQC results for raw reads   ├── ${sample_id}_1_fastqc.html
│   ├── ${sample_id}_1_fastqc.zip    ├── ${sample_id}_2_fastqc.html
│   ├── ${sample_id}_2_fastqc.zip
│   ├── ...
├── read_filtered/                   # Host-filtered reads   ├── ${sample_id}_fastp_hg38_1.fastq.gz
│   └── ${sample_id}_fastp_hg38_2.fastq.gz
├── fastqc_filtered/                 # FastQC results for filtered reads   ├── ${sample_id}_fastp_hg38_fastqc_filtered/
│      ├── ${sample_id}_fastp_hg38_1_fastqc.html
│      ├── ${sample_id}_fastp_hg38_1_fastqc.zip
│      ├── ${sample_id}_fastp_hg38_2_fastqc.html
│      ├── ${sample_id}_fastp_hg38_2_fastqc.zip
│   ├── ...
└── multiqc/                         # MultiQC report
    ├── multiqc_report.html
    └── multiqc_data/
        └── ...

Execution Examples and Results#

metaFun command line execution example#

metafun_pipeline

MultiQC report example#

MultiQC report integrates raw read and filtered read quality statistics across all samples.

metafun_pipeline

Example of HTML multiQC report!

Nextflow Processes in RAWREAD_QC Module#

Process

InputDir

OutputDir

Note

fastqc_raw

${params.inputDir}

${params.outdir}/fastqc_raw

FastQC analysis results for raw reads

fastp

${params.inputDir}

None

Performs trimming and quality filtering

humanread_filter

Output from fastp

${params.outdir}/read_filtered

Removes host reads (when params.filter is not ‘none’)

fastqc_filtered

Output from humanread_filter

${params.outdir}/fastqc_filtered

FastQC analysis results for filtered reads

multiQC

Results from all previous processes

${params.outdir}/multiqc

Comprehensive report of all QC results

Descriptions of Processes in RAWREAD_QC Module#

  1. FastQC on Raw Reads: Performs FastQC analysis on the raw input metagenomic reads specified by --inputDir ${your input directory}.

  2. fastp Processing: Trims and filters the metagenomic reads using fastp.

  3. Host Read Filtering (optional): Removes host reads using Bowtie2. Default input genome is GRCh38_p12.dna.primary_assembly. (-f human ). If you do not want to filter out any host reads, specify -f none in command line. Otherwise, if you want to filter out your own genome of interest, you could index your genome with (metafun) metafun -module PREPARE_CUSTOM_HOST -i $yourgenome -f${value} .

  4. FastQC on Filtered Reads: Performs FastQC analysis on the filtered reads.

  5. MultiQC Report: Generates a MultiQC report combining all QC results.

Tools Used in RAWREAD_QC#

Tool

Purpose

Version

Default paramters

Parameters that can be selected

FastQC

Quality control checks on raw sequence data

v0.12.1

default

you could select only cpus by --cpus ${number}

fastp

Trimming and filtering of raw metagenomic reads

0.23.4

default

you could select only cpus by --cpus ${number}

Bowtie2

Alignment of reads to remove host contamination

2.5.2

--very-sensitive: sensitivity preset, --un-conc-gz: gzipped metagenomic reads, unaligned to host genome , end-to-end

null

MultiQC

Aggregate results from bioinformatics analyses

v1.18

No specific parameters in this script

null

Usage Notes#

  • Custom index could be generated by specifying -f name using metafun -module PREPARE_CUSTOM_HOST -i yourgenome.fasta -f name

  • The script checks for the existence and non-emptiness of the input directory before proceeding.

  • Result of this module is used as input data for ASSEMBLY_BINNING, WMS_TAXONOMY and WMS_FUNCTION. The result of this module is mandatory input for WMS_TAXONOMY for Kraken2 with Bracken.