nf-core/stableexpression
This pipeline is dedicated to identifying the most stable genes within a single or multiple expression dataset(s). This is particularly useful for identifying the most suitable RT-qPCR reference genes for a specific species.
Introduction
This document describes the output produced by the pipeline.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
Pipeline overview
The pipeline is built using Nextflow and processes data using the following steps:
- Get accessions
- Get Expression Atlas dataset accessions corresponding to the provided species (and optionally keywords) (run by default; optional)
- Get NBCI GEO microarray dataset accessions corresponding to the provided species (and optionally keywords) (run by default; optional)
- Download data
- Download Expression Atlas data (run by default; optional)
- Download NBCI GEO data (run by default; optional)
- ID Mapping
- Map gene IDS to NCBI Entrez Gene IDS (or Ensembl IDs) for standardisation among datasets using g:Profiler (run by default; optional)
- Data normalisation
- Normalize RNAseq raw data using TPM (necessitates downloading the corresponding genome and computing transcript lengths) or CPM.
- Perform quantile normalisation on each dataset separately using scikit-learn
- Merge all data
- Compute base statistics for each gene, platform-wide and for each platform (RNAseq and microarray)
- Compute stability scoring
- Get list of candidate genes based on base statistics
- Run optimised, scalable version of Normfinder
- Run optimised, scalable version of Genorm (NOT run by default; optional)
- Compute stability scores for each candidate gene
- Aggregate results
- Prepare Dash Plotly app for further investigation of gene / sample counts
- Make
MultiQCreport
Output files
MultiQC
This report is located at multiqc/multiqc_report.html and can be opened in a browser.
Output files
multiqc/- MultiQC report file:
multiqc_report.html. - MultiQC data dir:
multiqc_data. - Plots created by MultiQC:
multiqc_plots.
- MultiQC report file:
MultiQC](http://multiqc.info) is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.
Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.
Dash Plotly app
dash_app/: folder containing the Dash Plotly app
To launch the app, you must first create and activate the appropriate conda environment:
conda env create -n nf-core-stableexpression-dash -f <OUTDIR>/dash_app/spec-file.txt
conda activate nf-core-stableexpression-dashthen:
cd dash_app
python app.pyand open your browser at http://localhost:8080
The app will try to use the port 8080 by default. If it is already in use, it will try 8081, 8082 and so on. Check the logs to see which port it is using.
Expression Atlas
Output files
public_data/expression_atlas/accessions/: accessions found when querying Expression Atlaspublic_data/expression_atlas/datasets/: count datasets (normalized:*.normalised.csv/ raw:*.raw.csv) and experimental designs (*.design.csv) downloaded from Expression Atlas.
GEO
Output files
public_data/geo/accessions/: accessions found when querying GEOpublic_data/geo/datasets/: count datasets (normalized:*.normalised.csv/ raw:*.raw.csv) and experimental designs (*.design.csv) downloaded from GEO.
IDMapping (g:Profiler)
Output files
idmapping/- Count datasets whose gene IDs have been mapped:
*.renamed.csv. - Table associating original gene IDs and mapped gene IDs:
*.mapping.csv. - Gene metadata (name and description):
*.metadata.csv.
- Count datasets whose gene IDs have been mapped:
Normalisation
Output files
normalised/: Newly normalised datasetsnormalised/deseq2/for DESeq2normalised/edger/for EdgeR
quantile_normalised: Quantile normalised datasets
Gene base statistics
Output files
merged_datasets/: Merged count datasets (sample-wide)merged_datasets/all/: all datasets togethermerged_datasets/rnaseq/: only RNA-seq datasetsmerged_datasets/microarray/: only microarray datasets
Merged counts
The file containing all normalised counts is bundled as a Parquet file with the Dash Plotly app.
Output files
dash_app/data/all_counts.parquet: Merged count datasets (sample-wide)
Summary of gene statistics and scores
The gene stat summary is also bundled with the Dash Plotly app.
Output files
dash_app/data/all_genes_summary.csv: file containing all gene statistics, scores and ranked by stability score
Overall experimental design
Output files
dash_app/data/whole_design.csv: file containing all experimental design information
Pipeline information
Output files
pipeline_info/- Reports generated by Nextflow:
execution_report.html,execution_timeline.html,execution_trace.txtandpipeline_dag.dot/pipeline_dag.svg. - Reports generated by the pipeline:
pipeline_report.html,pipeline_report.txtandsoftware_versions.yml. Thepipeline_report*files will only be present if the--email/--email_on_failparameter’s are used when running the pipeline. - Parameters used by the pipeline run:
params.json.
- Reports generated by Nextflow:
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.