sparse2big

From sparse to big data: Imputation and Fusion for Massive Sparse Data

Large data sets with many variables frequently contain unobserved, missing or noisy entries. Dealing with these missing values is crucial for any later step of analysis. Solutions in various fields have been developed, from general sample imputation to modeling observation processes or making downstream analyses robust against missing values. Only when properly dealing with these sparse data sets, including the combination of multiple sparse observations of the same entity, we can hope to achieve meaningful big data and in-depth insightful analyses. Hence developing, evaluating and sharing methods for data imputation and integration will be an enabler for many research areas, with potential use cases ranging from patient data in medicine to remote sensing in geography or sample noise in imaging.

In this proposal, we aim to bring together researchers across eight Helmholtz centers to develop and study such innovative methods and techniques. To allow for detailed evaluation, to gain international visibility, and to show the relevance of such methodological research, we propose to initially focus on one major use-case of strategic relevance to all Health Research Centers, namely single cell genomics. Technological advances are rapid and allow the profiling of genomes, transcriptomes and epigenomes at an unprecedented level of resolution, however at the cost of quality of a single observation and with a considerable number of missing values. Despite these drawbacks, these techniques are currently revolutionizing biology and medicine, combining the advantages of modern bulk sequencing techniques and microscopic analyses of single cells, providing a new type of “molecular microscope”. We will follow-up this initial study with additional, smaller-scale perspective projects for example in the context of remote sensing, and have a clear plan for translating the developed ideas to other ares within Helmholtz and beyond.

Milestone WP2-2:

As a part of the second work package, the "Young Investigators Group Bioinformatics and Transcriptomics" will evaluate whether imputation approaches can alleviate the sparseness on single cell RNA sequencing (scRNA-seq) data sets in order to apply network inference methods. Furthermore, we aim to investigate how the proposed missing-value invariant correlation estimation approach assists in identifying gene networks in scRNA-seq data.