(RAWREAD_QC_description)=

# <span style="color:#FF0000">RAWREAD_QC</span>

<img src="../_static/metafun1_red.png" style="height:200px; width:auto; float:right; margin-left:10px;" />
This module is a part of metaFun pipeline, designed for quality control of raw reads from whole metagenome sequencing data.

## Overview
The RAWREAD_QC module is the first step in the metaFun pipeline, designed for quality control of raw metagenomic sequencing data. It performs quality assessment, trimming, and host read filtering to prepare high-quality reads for downstream analyses. Any host read could be indexed and utilized by indexing using metaFun or you can skip host read filtering process. 


## Module Execution 

```{code-block} bash
# Basic usage
(metafun) metafun -module RAWREAD_QC -i /path/to/raw_reads

# Skip host filtering
(metafun) metafun -module RAWREAD_QC -i /path/to/raw_reads --filter none

# Use custom host genome for filtering (prepare your genome first).
# Index your genome and run RAWREAD_QC module.
(metafun) metafun -module PREPARE_CUSTOM_HOST -i yourgenome.fasta -f custom_genome_name
(metafun) metafun -module RAWREAD_QC -i /path/to/raw_reads --filter custom_genome_name
```



:::{admonition} Host read filtering out options.
:class: note

There are three options filtering the read, `human` ,`Skip filtering out`,`your custom genome`. 
- Default value is set to human genome. If your dataset is from human genome, you do not need to specify filter. 
- If you would like to skip host read filtering out, specify `-f none` or `--filter none`.
```{code-block} bash
:caption: Skip host read filtering out step and subsequent FastQC step. 

# If you would like to skip any read filtering out using interested genome, use this one.
 (metafun) metafun -module RAWREAD_QC -i ${inputDir} -f none
 ```

- If you would like to your own genome to filter out reads, index it first and specify that.
```{code-block} bash
:caption: Use your custom genome to filter out.

# Index your genome with specific name of yours. Any name could be utilized with `-f` , but you need to use this word in RAWREAD_QC module word.
(metafun) metafun -module PREPARE_CUSTOM_HOST -i yourgenome.fasta -f anyname
#Specify the filter name as `mygenome` that you used in indexing your genome(mygenome). 
(metafun) metafun -module RAWREAD_QC -i ${inputDir} -f anyname
:::

## Module Operation Sequence
The module performs the following steps:

1. FastQC analysis on raw reads. 
2. Trimming and quality filtering using fastp
3. Removal of host (e.g., human) reads using Bowtie2 (optional)
4. FastQC analysis on filtered reads 
5. MultiQC report generation

## Parameters
**`${launchDir}` is the directory where you execute metaFun, and utilized as output base directory.** 

| Parameter | Description | Default Value | Note |
|-----------|-------------|---------------|------|
| `-i, --inputDir` | Input directory containing raw reads | **You should set directory path contaning raw reads.** | **Required** |
| `-o, --outdir` | Output directory | `${launchDir}/results/metagenome/RAWREAD_QC` | Defult output directory is recommended for downstream analysis. |
| `-f, --filter` | Host genome filtering option | `human` | Options: `human`, `none`, or custom name |
| `-p, --processors` | Number of CPUs to use | `4` | |


## **Inputs and Outputs**

### Inputs
* Raw paired-end metagenomic reads (FASTQ format, can be gzipped)
* Specify directory of input reads with `-i ${inputDir} `
* File naming convention: `*_1.fastq.gz`/`*_2.fastq.gz` or `*_1.fq.gz`/`*_2.fq.gz` or without gzipped suffix.

(RAWREAD_QC_output)=

### Outputs
* Quality-controlled and filtered metagenomic reads
* Quality assessment reports and visualizations

### Output directory structure

 Output directory is at ${launchDir} or your specified directory path with `-o outdir`.

 We suppose you did not designate output directory as we recommended. 

**`${launchDir}` is the directory where you execute metaFun, and utilized as output base directory.** 

```{admonition} Switching input and output directory.
:class: note

If you defined output directory by specifying `-o ${output directory}`, you should have to modify those parameters in other modules too.
The default output directory is `results/metagenome/RAWREAD_QC.` in your ${launchDir}.
```



```{code-block} bash
:caption: Output directory structure
 # input samples are in  ${inputDir}, and input fastq file names may be ${sample_id}_{1,2}.fastq.gz or ${sample_id}_{1,2}.fq.gz or without gzipped suffix.

${launchDir}/results/metagenome/RAWREAD_QC/
├── fastqc_raw/                      # FastQC results for raw reads
│   ├── ${sample_id}_1_fastqc.html
│   ├── ${sample_id}_1_fastqc.zip 
│   ├── ${sample_id}_2_fastqc.html
│   ├── ${sample_id}_2_fastqc.zip
│   ├── ...
├── read_filtered/                   # Host-filtered reads
│   ├── ${sample_id}_fastp_hg38_1.fastq.gz
│   └── ${sample_id}_fastp_hg38_2.fastq.gz
├── fastqc_filtered/                 # FastQC results for filtered reads
│   ├── ${sample_id}_fastp_hg38_fastqc_filtered/
│   │   ├── ${sample_id}_fastp_hg38_1_fastqc.html
│   │   ├── ${sample_id}_fastp_hg38_1_fastqc.zip
│   │   ├── ${sample_id}_fastp_hg38_2_fastqc.html
│   │   ├── ${sample_id}_fastp_hg38_2_fastqc.zip
│   ├── ...
└── multiqc/                         # MultiQC report
    ├── multiqc_report.html
    └── multiqc_data/
        └── ...
```




## Execution Examples and Results

### metaFun command line execution example

```{figure} ../images/RAWREAD_QC_command.png
---
width: 100%
figclass: margin-caption
alt: metafun_pipeline
name: RAWREAD_QC_command
align: middle
---
```

### MultiQC report example

MultiQC report integrates raw read and filtered read quality statistics across all samples.

```{figure} ../images/multiqc.png
---
width: 100%
figclass: margin-caption
alt: metafun_pipeline
name: multiqc
align: middle
---
```

[Example of HTML multiQC report!](/_static/multiqc_report.html)


## Nextflow Processes in <span style="color:#FF0000">RAWREAD_QC</span> Module 

| Process | InputDir | OutputDir | Note |
|---------|----------|-----------|------|
| fastqc_raw | `${params.inputDir}` | `${params.outdir}/fastqc_raw`  | FastQC analysis results for raw reads |
| fastp | `${params.inputDir}` | None | Performs trimming and quality filtering |
| humanread_filter | Output from fastp | `${params.outdir}/read_filtered` | Removes host reads (when params.filter is not 'none') |
| fastqc_filtered | Output from humanread_filter | `${params.outdir}/fastqc_filtered` | FastQC analysis results for filtered reads |
| multiQC | Results from all previous processes | `${params.outdir}/multiqc` | Comprehensive report of all QC results |



## Descriptions of Processes in <span style="color:#FF0000">RAWREAD_QC</span> Module 

1. **FastQC on Raw Reads**: Performs FastQC analysis on the raw input metagenomic reads specified by `--inputDir ${your input directory}`.
2. **fastp Processing**: Trims and filters the metagenomic reads using fastp.
3. **Host Read Filtering** (optional): Removes host reads using Bowtie2. Default input genome is GRCh38_p12.dna.primary_assembly. (`-f human` ).  If you **do not want to filter out any host reads**, specify `-f none` in command line. Otherwise, if you want to **filter out your own genome of interest**, you could index your genome with  `(metafun) metafun -module PREPARE_CUSTOM_HOST -i $yourgenome -f${value}` . 
4. **FastQC on Filtered Reads**: Performs FastQC analysis on the filtered reads.
5. **MultiQC Report**: Generates a MultiQC report combining all QC results.

## Tools Used in <span style="color:#FF0000">RAWREAD_QC</span>


| Tool | Purpose | Version | Default paramters | Parameters that can be selected  |
|------|---------|------------|------------|------------|
| FastQC | Quality control checks on raw sequence data |  v0.12.1 | default | you could select only cpus by `--cpus ${number}` |
| fastp | Trimming and filtering of raw metagenomic reads |  0.23.4 | default |  you could select only cpus by `--cpus ${number}` |
| Bowtie2 | Alignment of reads to remove host contamination | 2.5.2 |`--very-sensitive`: sensitivity preset, `--un-conc-gz`: gzipped metagenomic reads, unaligned to host genome , `end-to-end`  | null |
| MultiQC | Aggregate results from bioinformatics analyses | v1.18 | No specific parameters in this script | null |

## Usage Notes

- Custom index could be generated by specifying `-f name` using `metafun -module PREPARE_CUSTOM_HOST -i yourgenome.fasta -f name`
- The script checks for the existence and non-emptiness of the input directory before proceeding.
- Result of this module is used as input data for **[<span style="color:#FF9300">ASSEMBLY_BINNING</span>](ASSEMBLY_BINNING_description)**, **[<span style="color:#0846FA">WMS_TAXONOMY</span>](WMS_TAXONOMY_description)** and **[<span style="color:#7030A0">WMS_FUNCTION</span>](WMS_FUNCTION_description)**.
The result of this module is **mandatory input** for <span style="color:#0846FA">WMS_TAXONOMY</span> for Kraken2 with Bracken. 
