Software - Helmholtz-Centre for Environmental Research

Tools

At COMPBC, we develop and provide tools for robust, consistent, and reproducible data analysis, including AI and other data science software aimed at predicting chemical-effect associations and protein-protein interactions at a universe level. Our programs also identify enriched pathways in multi-omics data, and we provide computational methods to find, curate and systematically retrieve suitable datasets for our research.

uap – Robust, Consistent, and Reproducible Data Analysis

Authors

Christoph Kämpf, Michael Specht, Sven-Holger Puppel, Alexander Scholz, Gero Doose, Kristin Reiche, Jana Schor, Jörg Hackermüller

Summary

uap executes, controls and keeps track of the analysis of large data sets. It enables users to perform robust, consistent, and reproducible data analysis. uap encapsulates the usage of (bioinformatic) tools and handles data flow and processing during an analysis. Users can use predefined or self-made analysis steps to create custom analysis. Analysis steps encapsulate best practice usages for bioinformatic software tools. uap focuses on the analysis of high-throughput sequencing (HTS) data. But its plugin architecture allows users to add functionality, such that it can be used for any kind of large data analysis.

uap is a command-line tool, implemented in Python. It requires a user-defined configuration file, which describes the analysis, as input.

upa supports grid engines such as SLURM and UGE for connecting to HPC clusters.

Important Links

Software download	https://github.com/yigbt/uap
Documentation	https://uap.readthedocs.io/en/master/index.html
Docker build's context	https://github.com/yigbt/uap-docker
Travis CI	https://travis-ci.org/yigbt/uap
Singularity Container	https://cloud.sylabs.io/library/bioinf_ufz/uap/uap.sif

deepFPlearn - AI for predicting chemical-effect associations at the universe level

Authors

Jana Schor, Patrick Scheibe, Matthias Bernt, Wibke Busch, Chih Lai, Jörg Hackermüller

Summary

deepFPlearn is an AI tool that predicts associations between chemicals and gene targets. Based on their molecular structure, chemicals often interfere with biomolecules, leading to adverse effects in the respective organism. deepFPlearn is a ready-to-use deep learning (DL) tool that combines feature reduction with a deep autoencoder and subsequent classification with a deep feed-forward neural network. We decreased the discrepancy between large descriptor size (molecular structure of a chemical) and the limited amount of labeled training data by i) using a simple representation of the chemical's structure – the binary fingerprint; and ii) by applying feature compression prior to the classification to an effect. We provide trained models for endocrine disruption (ED), i.e., chemicals that mimic or interfere with the body's hormones. However, the tool is highly flexible and trainable with other datasets.

Availability

Code repository	https://github.com/yigbt/deepFPlearn
Preprint	Jana Schor, Patrick Scheibe, Matthias Bernt, Wibke Busch, Chih Lai, Jörg Hackermüller AI for predicting chemical-effect associations at the universe level - deepFPlearn bioRxiv 2021.06.24.449697; doi: https://doi.org/10.1101/2021.06.24.449697

multiGSEA: A GSEA-based pathway enrichment analysis for multi-omics data

Authors

Sebastian Canzler, Jörg Hackermüller

Summary

Gaining biological insights into molecular responses to treatments or diseases from omics data can be accomplished by gene set or pathway enrichment methods. A plethora of different tools and algorithms have been developed so far. Among those, the gene set enrichment analysis (GSEA) proved to control both type I and II errors well.

In recent years the call for a combined analysis of multiple omics layer became prominent, giving rise to a few multi-omics enrichment tools. Each of which has its own drawbacks and restrictions regarding its universal application.

Here, we present the multiGSEA package aiding to calculate a combined GSEA-based pathway enrichment on multiple omics layer. The package queries 8 different pathway databases and relies on the robust GSEA algorithm for a single-omics enrichment analysis. In a final step, those scores will be combined to create a robust composite multi-omics pathway enrichment measure. multiGSEA supports 11 different organisms and includes a comprehensive mapping of transcripts, proteins, and metabolite IDs.

Important links

Software download	GitHub
Documentation	Bioconductor Vignette
Bioconductor	Bioconductor Package
Citation	Sebastian Canzler, Jörg Hackermüller. multiGSEA: A GSEA-based pathway enrichment analysis for multi-omics data. BMC Bioinformatics 21, 561 (2020). https://doi.org/10.1186/s12859-020-03910-x

ProteinPrompt: predicting protein-protein interactions

Authors

Sebastian Canzler, Markus Fischer, David Ulbricht, Nikola Ristic, Peter W. Hildebrand, René Staritzbichler

Summary

ProteinPrompt is a webserver and stand-alone tool that uses machine-learning algorithms to calculate specific, currently unknown protein-protein interactions by means of the amino acid sequence alone. It's designed to quickly and reliably predict contacts based on an input sequence in order to scan large sequence libraries for potential binding partners, with the goal to accelerate and assure the quality of the laborious process of drug target identification.

Availability

Webserver	https://proteinformatics.uni-leipzig.de/protein_prompt/
Gitlab	https://gitlab.hzdr.de/proteinprompt/ProteinPrompt
Docker Container	GitLab Registry
Citation	Sebastian Canzler, Markus Fischer, David Ulbricht, Nikola Ristic, Peter W Hildebrand, René Staritzbichler, ProteinPrompt: a webserver for predicting protein–protein interactions, Bioinformatics Advances, Volume 2, Issue 1, 2022, vbac059, https://doi.org/10.1093/bioadv/vbac059

MOD-Finder - Search compound-related multi-omics data sets in various sources.

Authors

Sebastian Canzler, Jörg Hackermüller, Jana Schor

Summary

It is a highly tedious task to collect omics data sets from different molecular levels such as transcriptome, proteome, and metabolome, to be used in a multi-omics data analysis. This is mainly because of a large amount of potential databases to search in, their non-unified querying system which results in a fairly large amount of manual work.

To surmount these obstacles, we developed the Multi-Omics Data set Finder (MOD-Finder) as part of the CEFIC LRI-C5 XomeTox project, an R Shiny application, to efficiently search for compound-related omics data sets in an automated manner. Therefore, several publicly available databases are automatically queried for data sets with relation to a user specified compound or toxicant. The results are presented in a plain datatable. Additionally, compound-related information such as distinct IDs, synonyms, description, as well as visualizations regarding chemical-gene interactions or KEGG pathway enrichments are provided.

Important Links

Source code	https://github.com/yigbt/MOD-Finder
Citation	Canzler, S, Hackermüller, J, Schor, J (2019): MOD-Finder: Identify multi-omics data sets related to defined chemical exposure; arxiv.org (preprint);https://doi.org/10.48550/arXiv.1907.06346

Containers for Reproducible Research

Container for Transcriptome Analysis

Description

We built a docker container specifically designed for transcriptomics data analysis. We utilize the rocker/verse container and extend them by several R packages from CRAN and Bioconductor to ensure a reproducible working environment.

Within the container, a rstudio-server is running and enables remote access through the webbrowser.

Availability

Download the docker container from DockerHub: https://hub.docker.com/r/boll3/rocker_transcriptomics

Author

Sebastian Canzler

rocker/verse

The rocker project offers version-stable rocker images with rstudio server. The particular rocker/verse images are extended by tidyverse packages as well as tex and publishing-related packages.

Current rocker/verse version: 4.1.0

Usage

How to use the docker container is nicely described in the rocker manual.

Additional Packages

In order to be able to run transcriptomics analysis, we extended the rocker/verse container by several R packages from CRAN and Bioconductor.

Plotting and visuals

EnhancedVolcano
karyoploteR
enrichplot

Differential gene expression analysis

DESeq2
IHW
sva
RUVSeq

Functional characterization

fgsea
multiGSEA
clusterProfiler
EGSEA

Annotation

org.Rn.eg.db
org.Hs.eg.db
org.Mm.eg.db
org.Dr.eg.db
biomaRt
AnnotationHub
metaboliteIDmapping
BSgenome.Rnorvegicus.UCSC.rn6

CRAN packages

Rcpp
BiocParallel
hexbin
apeglm
ashr
glmpca
pheatmap
eulerr
PoiClaClu
msigdbr
gtools
DT
proj4
WGCNA
msigdbr
bookdown
gridExtra
xtable
ggnewscale
ggupset
ggridges

Container for Multi-Omics Data Analysis

Description

Here, we published a docker container specifically designed for
multi-omics data analysis. We utilize the rocker/verse container
and extend them by several R packages from `CRAN` and `Bioconductor` to
ensure a reproducible working environment.

Avaliability

Download the docker container from DockerHub: https://hub.docker.com/r/boll3/rocker_multiomics

Author

Sebastian Canzler

rocker/verse

The rocker project offers
version-stable rocker images with rstudio server. The particular
rocker/verse images are
extended by tidyverse packages as well as tex and publishing-related
packages.

Current rocker/verse version: 4.1.2

Usage

How to use the docker container is nicely described in the rocker manual.

Additional Packages

In order to be able to run multi-omics analysis, we extended the
rocker/verse container by several R packages from `CRAN` and
`Bioconductor`.

Multi-omics analysis

MOFA2
mixOmics

Plotting and visuals

EnhancedVolcano
enrichplot

Tools for single-omics analysis and data preparation

DESeq2
limma
DEP

Functional characterization

fgsea
multiGSEA
clusterProfiler
EGSEA

Annotation

biomaRt
AnnotationDbi
AnnotationHub
org.Rn.eg.db
org.Hs.eg.db
org.Mm.eg.db
org.Dr.eg.db
metaboliteIDmapping
BSgenome.Rnorvegicus.UCSC.rn6

CRAN packages

Rcpp
BiocParallel
hexbin
eulerr
pheatmap
msigdbr
gtools
DT
proj4
bookdown
gridExtra
xtable
ggnewscale
ggupset
ggridges
reticulate

Packages

metaboliteIDmapping R package

Description

The R package 'metaboliteIDmapping' provides a comprehensive mapping table of nine different Metabolite ID formats and their common name. The data has been collected and merged from four publicly available source, including HMDB, Comptox Dashboard, ChEBI, and the graphite Bioconductor R package.

Availability

Bioconductor package	https://bioconductor.org/packages/metaboliteIDmapping/
Documentation	http://bioconductor.org/packages/release/data/annotation/vignettes/metaboliteIDmapping/inst/doc/metaboliteIDmapping.html
Github	https://github.com/yigbt/metaboliteIDmapping

Author

Sebastian Canzler

Toolbox Toxicokinetic Modeling

Toolbox