(INTERACTIVE_STRAIN_description)=

# <span style="color:#2FA4E7">INTERACTIVE_STRAIN</span>

<img src="../_static/metafun6_ocean.png" style="height:200px; width:auto; float:right; margin-left:10px;" />
This module is a part of metaFun pipeline, providing an interactive web interface for exploring and analyzing strain-level diversity results generated by InStrain.

## Overview
The INTERACTIVE_STRAIN module offers a dynamic, web-based platform for visualizing and analyzing strain-level metagenomic data. It allows researchers to interactively explore nucleotide diversity patterns, analyze selection pressure through pN/pS ratios, identify shared strains across samples, and correlate diversity metrics with metadata variables. The module leverages preprocessed InStrain output, enabling advanced statistical analyses and customizable visualizations without requiring programming knowledge.

**This module provides an interactive alternative to command-line analysis**, allowing users to explore different metrics and comparisons dynamically rather than specifying them in advance.

## Module Execution

```{code-block} bash
# Basic usage with WMS_STRAIN results
(metafun) metafun -module INTERACTIVE_STRAIN -i results/metagenome/WMS_STRAIN

# Specify custom port for the web interface
(metafun) metafun -module INTERACTIVE_STRAIN -i results/metagenome/WMS_STRAIN -p 8080

# Add additional metadata file
(metafun) metafun -module INTERACTIVE_STRAIN -i results/metagenome/WMS_STRAIN -m updated_metadata.csv
```

## Module Operation Sequence

This module performs the following steps:

1. **Loading strain-level data** from WMS_STRAIN results:
   - Reading InStrain profile outputs from all samples
   - Integrating nucleotide diversity and pN/pS data
   - Loading GTDB taxonomic metadata for genome annotation
   - Combining with sample metadata

2. **Launching an interactive web server** with multiple analytical modules:
   - Shared Taxa Analysis for strain distribution visualization
   - Nucleotide Diversity Explorer for microdiversity analysis
   - pN/pS Analysis for selection pressure investigation
   - Diversity-pN/pS Correlation for integrated analysis

3. **Enabling on-demand analysis** through interactive components:
   - Dynamic filtering by genome, sample, or metadata
   - Statistical testing across conditions
   - Customizable visualization options
   - Data export for downstream applications

## Parameters

| Parameter | Description | Default Value | Note |
|-----------|-------------|---------------|------|
| `-i, --input` | Input directory with WMS_STRAIN results | Required | Path to WMS_STRAIN output containing InStrain profiles |
| `-m, --metadata` | Additional metadata file | Optional | CSV/TSV file with updated or additional sample metadata |
| `-g, --gtdb` | GTDB metadata file | Auto-detected | TSV file with genome taxonomic information |
| `-p, --port` | Port number for the web interface | `8050` | Adjust if the default port is already in use |
| `--cpus` | Number of CPU cores to use | `4` | For computationally intensive operations |

## **Inputs and Outputs**

### Inputs
* Results directory from a completed <span style="color:#2FA4E7">WMS_STRAIN</span> run, containing InStrain profile outputs
* Optional additional metadata file in CSV/TSV format for enhanced analysis
* Optional GTDB metadata file for taxonomic annotation

### Outputs
* Exported visualizations in various formats (PNG, PDF, SVG)
* Statistical analysis results (CSV, TSV)
* Filtered data tables
* Publication-ready figures

### Output directory structure

The generated files are saved in a timestamped output directory:

```{code-block} bash
:caption: Output directory structure

${launchDir}/results/interactive_strain/YYYYMMDDHHMMSS/
├── exported_figures/                     # Exported visualizations
│   ├── shared_taxa_venn_[timestamp].pdf       # Venn diagrams
│   ├── shared_taxa_upset_[timestamp].pdf      # UpSet plots
│   ├── nucleotide_diversity_[timestamp].pdf   # Diversity boxplots/heatmaps
│   ├── pnps_analysis_[timestamp].pdf          # pN/pS distribution plots
│   └── correlation_plot_[timestamp].pdf       # Diversity vs pN/pS correlations
├── statistical_results/                  # Results from statistical tests
│   ├── diversity_stats_[timestamp].csv        # Diversity statistics
│   ├── pnps_comparison_[timestamp].csv        # pN/pS comparison results
│   └── correlation_results_[timestamp].csv    # Correlation analysis
└── exported_data/                        # Exported data tables
    ├── shared_taxa_matrix_[timestamp].csv     # Taxa sharing matrix
    ├── diversity_data_[timestamp].csv         # Nucleotide diversity values
    └── pnps_data_[timestamp].csv              # pN/pS ratio data
```

## Interface Components

The web interface is divided into multiple tabs, each providing specialized tools for different types of strain-level analysis.

### Main Interface Structure

![Main Interface](../images/00STRAIN_interface.png)

The main interface consists of a sidebar panel for parameter configuration, navigation tabs for accessing different analysis modules (Shared Taxa Analysis, Nucleotide Diversity, pN/pS Analysis, Diversity vs pN/pS), and a main content area displaying analysis results.

---

### Sidebar Panel

![Sidebar Panel](../images/01STRAIN_sidebar.png)

The sidebar panel provides controls for loading WMS_STRAIN results. Users specify the InStrain output directory path, and the application automatically loads all profile data including sample metadata from the `integrated_microbiome_data.rds` file. Optional inputs include GTDB metadata file for taxonomic annotation and additional sample metadata file if needed.

:::{admonition} Understanding popANI and conANI
:class: note

InStrain calculates two ANI metrics for strain comparison:
- **popANI** (population-level ANI): Considers both major and minor alleles, accounting for within-sample microdiversity
- **conANI** (consensus ANI): Compares only consensus (major) alleles between samples

The default threshold of 99.999% popANI is recommended for identifying shared strains. Users can adjust these thresholds in the Shared Taxa Analysis module based on their research needs. For more details, see the [InStrain documentation](https://instrain.readthedocs.io/en/latest/important_concepts.html).
:::

---

:::{admonition} Tab-specific analysis design
:class: note

All analyses in this interface are designed to apply only within their respective tabs. Each analytical module operates independently, meaning that:
- Settings changed in one tab do not affect the analyses in other tabs
- Each tab maintains its own state and configuration
- Results generated in one tab are specific to that tab's analysis
- This modular design allows you to run different analyses with different parameters simultaneously without interference
:::

### Shared Taxa Analysis

This module identifies and visualizes strains/genomes shared across samples.

![Shared Taxa Count](../images/02STRAIN_Sharing_count.png)

The shared taxa count view displays Venn diagrams and UpSet plots showing overlapping taxa between sample groups. It visualizes the number of shared and unique genomes across conditions, with intersection sizes for all combinations of groups.

---

![Shared Taxa by Group](../images/03STRAIN_sharing_group.png)

The group-based sharing analysis enables comparison of strain distribution patterns across metadata categories. It shows how genomes are distributed among different experimental conditions with statistical summaries.

---

![Shared Taxa Heatmap](../images/04STRAIN_sharing_heatmap.png)

The presence/absence heatmap provides a matrix visualization of genome detection across all samples. Hierarchical clustering reveals sample grouping patterns, with color-coding indicating detection status and options for prevalence filtering.

### Nucleotide Diversity Explorer

Comprehensive analysis of within-population genetic diversity.

![Nucleotide Diversity Overview](../images/05STRAIN_NucDiv_overview.png)

The overview panel displays the distribution of nucleotide diversity (π) values across all genomes and samples. InStrain calculates nucleotide diversity using the [Nei and Li (1979)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1213565/) method: π = 1 - Σ(frequency of each base)², computed at every genomic position with sufficient coverage (default ≥5x) and averaged across genes/genomes. This metric quantifies within-population genetic variation and is robust to coverage differences between samples. The panel provides summary statistics and quality filtering options for reliable microdiversity estimates.

---

![Nucleotide Diversity Distribution](../images/06STRAIN_Nuc_div.png)

The diversity distribution view shows detailed nucleotide diversity patterns with histograms and density plots. Users can explore the range of diversity values and identify genomes with high or low microdiversity.

---

![Nucleotide Diversity by Group](../images/07STRAIN_NucDiv_group.png)

The group comparison view presents boxplots comparing nucleotide diversity between metadata groups. Statistical tests (Wilcoxon, Kruskal-Wallis) are performed with effect size calculations and multiple testing correction options.

---

![SNVs per kbp](../images/08STRAIN_SNVsperkbp.png)

The SNV density analysis shows the number of single nucleotide variants per kilobase pair across genomes and samples. This metric provides insight into mutation rates and population-level genetic variation.

---

![SNVs vs Nucleotide Diversity](../images/09STRAIN_SNVs_Nucdiv.png)

The correlation view displays the relationship between SNV counts and nucleotide diversity metrics. Scatterplots with regression lines reveal how these metrics relate across different genomes and sample conditions.

### pN/pS Analysis

Selection pressure analysis at multiple levels.

![pN/pS Analysis Overview](../images/10STRAIN_pNpSanal.png)

The genome-wide pN/pS analysis displays the distribution of pN/pS ratios across all genomes, identifying genomes under selection pressure. Group comparisons with statistical significance testing reveal differential selection patterns between conditions.

---

![Gene-level pN/pS by Group](../images/11STRAIN_gene_group.png)

The gene-level analysis provides fine-scale selection pressure investigation per gene with functional annotation integration (COG categories). Heatmap visualizations by functional category enable identification of genes under positive or purifying selection.

### Diversity vs pN/pS Correlation

Integrated analysis of diversity and selection.

![Diversity vs pN/pS Correlation](../images/12STRAIN_div_psps.png)

The correlation analysis displays scatterplots of nucleotide diversity versus pN/pS ratios with regression lines and confidence intervals. Per-genome and per-sample correlations are calculated using Pearson and Spearman methods, with group-specific analysis and multiple testing correction. This integrated view reveals relationships between population diversity and selection pressure across different conditions.

## Usage Workflow

A typical analysis workflow in the INTERACTIVE_STRAIN module includes:

1. Load your InStrain data using the Data Input panel
2. Use Shared Taxa Analysis to understand genome distribution
3. Explore nucleotide diversity patterns across samples and groups
4. Analyze selection pressure using pN/pS ratios
5. Investigate diversity-selection relationships
6. Export results as publication-ready figures and tables

:::{admonition} Tips for optimal performance
:class: tip

- For large datasets, filter to genomes of interest to improve responsiveness
- Use coverage thresholds to focus on high-confidence data
- Consider biological relevance when interpreting pN/pS values
- Export both visualizations and raw data for comprehensive documentation
:::

:::{admonition} Interpreting strain-level results
:class: note

When analyzing strain-level data, consider:
1. **Coverage effects**: Low-coverage genomes may have unreliable diversity estimates
2. **Sample size**: Statistical comparisons require adequate samples per group
3. **Biological context**: pN/pS interpretation depends on generation time and population size
4. **Multiple genomes**: Be cautious of multiple testing when analyzing many genomes
:::

## Usage Notes

- The **<span style="color:#2FA4E7">INTERACTIVE_STRAIN</span>** module works with InStrain output from the <span style="color:#2FA4E7">WMS_STRAIN</span> module.
- For optimal performance with large datasets (>50 samples, >100 genomes), consider increasing CPU allocation.
- The interface automatically detects available genomes and metadata variables.
- Keep the terminal running while using the web interface; closing it will terminate the server.
- GTDB metadata enhances analysis by providing taxonomic context for genomes.
- Nucleotide diversity and pN/pS calculations require sufficient coverage (typically >5x).

## Technical Implementation

The INTERACTIVE_STRAIN module is built using the Shiny framework and implements a modular architecture:

- **Core Components**:
  - **app.R**: Main application file defining the UI structure and server logic
  - **helper/plot_customization.R**: Shared functions for plot styling
  - **helper/error_handlers.R**: Error handling and user feedback

- **Analytical Modules**:
  - **shared_taxa_module.R**: Venn diagrams and UpSet plots for taxa sharing
  - **nucleotide_diversity_module.R**: Diversity analysis and visualization
  - **pnps_module_quick.R**: Genome-wide pN/pS analysis
  - **gene_level_pnps_module.R**: Gene-level selection analysis
  - **diversity_pnps_correlation_module.R**: Integrated correlation analysis

Each module is designed as a self-contained component with its own UI and server logic, allowing for independent development and maintenance while ensuring consistent data handling across the application.

## Next Steps

After exploring strain-level data in the <span style="color:#2FA4E7">INTERACTIVE_STRAIN</span> module, you can:

1. Complement with taxonomic analysis using the <span style="color:#0846FA">WMS_TAXONOMY</span> module
2. Explore functional potential with the <span style="color:#7030A0">WMS_FUNCTION</span> module
3. Investigate microbial interactions using the <span style="color:#2FA4E7">INTERACTIVE_NETWORK</span> module
4. Export processed data for custom analyses in R or Python

The <span style="color:#2FA4E7">INTERACTIVE_STRAIN</span> module provides comprehensive strain-level analysis, enabling researchers to understand microdiversity patterns, selection pressures, and evolutionary dynamics within microbial communities.
