Rahul Satija

and 2 more

Aligning Biological Manifolds for Integrated Analysis of Single Cell Data1. SummaryThe ability to integrate information across multiple datasets represents an enormous opportunity for the human cell atlas (HCA), yet presents substantial computational challenges. Even for individual tissues, the HCA will be constructed from datasets produced across multiple technologies, with samples taken from multiple individuals. However, in order to comprehensively define a set of human cellular phenotypes, the community cannot build a separate atlas for each technology, but instead must be capable of unifying together findings from diverse experiments. Here we aim to address a fundamental question for the human cell atlas: how can we integrate a diverse community effort to study a complex human system effectively in order to construct a coherent atlas? We propose that powerful machine-learning techniques based on 'joint manifold learning', often used in the 'alignment' of massive imaging datasets to recognized shared high-dimensional features, can be used to recognize shared cellular phenotypes across single cell datasets. This proposal will establish a new collaboration between the Satija Lab at New York Genome Center, and the Marioni and Stegle Labs at EMBL/EBI, who all have leading expertise in computational integration for scRNA-seq, but have not previously worked together. We will collaboratively develop a set of methods and best practices for data integration, alongside novel metrics and benchmarks that are of significant importance and value to the community. We will apply these approaches to fully integrate seven human neuronal scRNA-seq datasets, combining data from shallow, deep, cytoplasmic, and nuclear scRNA-seq technologies. Finally, we will extend these approaches to integrate datasets generated across different developmental timepoints, species, and spatial technologies. Our work will generate not only benchmarks and metrics, but also a clear roadmap for how the HCA can assemble an integrated atlas from diverse data types from virtually any tissue.2. Project Aims, and how they address program goals