Aligning Biological Manifolds for Integrated Analysis of Single Cell Data1. SummaryThe ability to integrate information across multiple datasets represents an enormous opportunity for the human cell atlas (HCA), yet presents substantial computational challenges. Even for individual tissues, the HCA will be constructed from datasets produced across multiple technologies, with samples taken from multiple individuals. However, in order to comprehensively define a set of human cellular phenotypes, the community cannot build a separate atlas for each technology, but instead must be capable of unifying together findings from diverse experiments. Here we aim to address a fundamental question for the human cell atlas: how can we integrate a diverse community effort to study a complex human system effectively in order to construct a coherent atlas? We propose that powerful machine-learning techniques based on 'joint manifold learning', often used in the 'alignment' of massive imaging datasets to recognized shared high-dimensional features, can be used to recognize shared cellular phenotypes across single cell datasets. This proposal will establish a new collaboration between the Satija Lab at New York Genome Center, and the Marioni and Stegle Labs at EMBL/EBI, who all have leading expertise in computational integration for scRNA-seq, but have not previously worked together. We will collaboratively develop a set of methods and best practices for data integration, alongside novel metrics and benchmarks that are of significant importance and value to the community. We will apply these approaches to fully integrate seven human neuronal scRNA-seq datasets, combining data from shallow, deep, cytoplasmic, and nuclear scRNA-seq technologies. Finally, we will extend these approaches to integrate datasets generated across different developmental timepoints, species, and spatial technologies. Our work will generate not only benchmarks and metrics, but also a clear roadmap for how the HCA can assemble an integrated atlas from diverse data types from virtually any tissue.2. Project Aims, and how they address program goals
We thank the reviewers for their careful reading and constructive feedback on our manuscript. We appreciated that each reviewer offered support and enthusiasm for our work, but raised substantial concerns. We have therefore added substantial new experimental, analytical, and functional datasets to our manuscript. These new data are in agreement with our original conclusions, but we believe fully address the reviewer concerns and significantly strengthen the manuscript. We provide a point-by-point response to each reviewer concern below, but first summarize our new data and analyses, emphasizing points that were raised by multiple reviewers.Technical limitations of scRNA-seq data. Reviewers #2 and #3 both highlighted that the sequencing depth in our original Drop-seq data could limit the interpretation of our dataset, particularly our finding that mitotic progenitors do not show transcriptomic evidence of fate specification. To address this, we first more than doubled the sequencing depth of our Drop-seq dataset. In parallel, we used a version of the SMART-Seq2 protocol to obtain deep sequencing data (4-7,000 genes/cell) from 400 mitotic progenitors. Both new datasets strongly validate our original findings, producing nearly identical developmental trajectories, and exhibiting no evidence of distinct subpopulations of fate-restricted cells in the subventricular zone. As requested by Reviewer #2, we performed the SMART-Seq2 experiments with the FlashTag system, providing an independent validation that our inferred maturation trajectory is consistent with biological time. Finally, we added data produced using the 10X genomics system for 7,000 postmitotic progenitors with significantly increased sequencing depth (1,500-11,000 UMI/cell, 1,000-4,000 genes/cell), and observed identical branching patterns as in our original Drop-seq dataset. Taken together, these new data demonstrate that our original findings and interpretations were not skewed by technical limitations, and could be fully reproduced from complementary scRNA-seq techniques.Identifying the emergence of distinct subtypes. While our initial manuscript focused on our unexpected finding that committed progenitor subsets could not be identified in mitotic progenitors, all reviewers requested further analysis of when distinct interneuron subsets first emerge during development, and how heterogeneity in the eminences could be connected to later timepoints. Leveraging a Dlx6a; Cre ;Ai9 line that uniformly marks interneuron progenitors, We therefore collected a set of 15,522 single cells across a developmental timecourse (E13.5, E18, P10), obtaining 3,432 additional P56 cells from a public resource hosted by the Allen Brain Atlas. We next applied our recently developed 'alignment' tool for scRNA-seq datasets (Butler et al., 2017), to identify shared sources of variation that exist between developmental and adult stages. This allowed us to link the emergence of interneuron subtypes across embryonic, postnatal, and adult timepoints, to identify the earliest emergence of non-overlapping populations that we hypothesize are committed to Vip, Sst, Id2, and Pvalb interneuron fates, and the initial transcriptomic markers that mark these groups across all developmental stages. Remarkably, functional perturbation of one of the strongest discriminating genes between early Pvalb and Sst progenitors, the autism-associated transcription factor Mef2c, led to a complete yet specific ablation of Pvalb interneurons, validating our computational predictions. We believe that these aligned datasets will serve as a valuable resource, in particular by revealing rich lists of genes that likely play similarly important functional roles in the establishment and maintenance of interneuron fates.Biological validation of key computational predictions. All reviewers requested that complementary methods to scRNA-seq be used to broaden support for our key conclusions, and Reviewer #2 specifically requested that specific computational predictions from the sequencing data should be validated and functionally characterized. [NB from Rahul : This paragraph is repetitive, but meant to highlight how new experiments are in the revision in one place, most specificially for Reviewer 2. We could add more here, but may want to highlight techniques and data that are not more scRNA-seq, though we could add Lhx6 discussion here as well]. As discussed above, we generated data using the tagging system from the Jabaudon lab (FlashTag system) as a complementary technique to measure a cell's true biological birth date, and validate our inferred maturation trajectory. We leveraged trancriptomic markers of our three post-mitotic 'branches' to perform genetic fate mapping of MGE branch 3 cells based on specific expression of Lhx8, and find that Lhx8+ progenitors do not give rise to cortical interneurons as predicted, but instead of give rise to X. Finally, we validate a functional role for a key transcription factor, Mef2c, that we predict discriminates between early progenitors of MGE-derived interneuron subtypes, by performing immunohistochemistry on a conditional knockout mouse. These experiments demonstrate the high quality of our computational predictions from Drop-seq data, and more importantly, provide complementary support and validation for our key conclusions.Conceptual insights that advance our understanding of interneuron development. All reviewers expressed enthusiasm for our dataset and analysis, but requested greater clarification of our new insights and advances. In the revised manuscript, we have worked to clearly articulate the following advances : First, we discover that mitotic progenitors exhibit tightly conserved maturation programs across eminences, with downstream differences in fate seeded by only a few genes that have largely been previously described. We see no evidence for detailed fate restriction or non-overlapping subpopulations for individual cells within an eminence, an important insight that weighs heavily on outstanding questions in the field. Second, we discover that upon cell-cycle exit, cells branch into one of three distinct precursor states, representing fate-restricted populations of interneurons and projection neurons. Surprisingly, the markers defining the branches exhibit conservation across the MGE, CGE, LGE, although the proportion of cells in each state does vary across eminences, along with the exact expression programs activated along each branch. Lastly, we trace the emergence of interneuron heterogeneity across a developmental time series, and identify subtle heterogeneity in the postmitotic MGE that separates SST and Pvalb-committed progenitors. We validate that Mef2c, a transcriptomic marker that is enriched in PV -mapping progenitors at multiple time points, is functionally required for the production of Pvalb but not Sst interneurons. Similarly, we propose a list of early markers for early progenitors of Sst, Vip, and Id2 interneurons which are conserved throughout development, and may play similar functional roles.
SUMMARYHematopoietic stem cells (HSC) give rise to diverse cell types in the blood system, yet our molecular understanding of the early trajectories that generate this enormous diversity in humans remains incomplete. Here we leverage Drop-seq, a massively parallel single cell sequencing approach, to individually profile more than 20,000 progenitors cells from human cord blood, without prior enrichment or depletion for individual lineages based on surface markers. Our data reveal a transcriptional ‘atlas’ of progenitor states in human cord blood, representing four committed lineages downstream from HSC, alongside the transcriptional dynamics underlying commitment. We identify intermediate stages that simultaneously co-express 'primed' programs for multiple downstream lineages, and also observe striking heterogeneity in the early molecular transitions between myeloid subsets. Integrating our data with an scRNA-seq dataset from human bone marrow, we illustrate the molecular similarity between these two commonly used systems, and further explore the chromatin dynamics of 'primed' transcriptional programs based on ATAC-seq. Finally, we demonstrate and functionally validate that Drop-seq data can be utilized to identify new heterogeneous surface markers of cell state. Our findings shed significant new light on early fate transitions in human hematopoiesis, and highlight the exciting potential for unbiased single cell approaches to reconstruct complex developmental processes.INTRODUCTIONHematopoiesis is the dynamic process by which a single hematopoietic stem cell (HSC) can give rise to the breathtaking cellular diversity present in blood, potentially representing tens to hundreds of distinct cell types which can be loosely grouped into erythroid, myeloid, and lymphoid lineages (Becker et al., 1963; Orkin, 2000; Orkin and Zon, 2008; Seita and Weissman, 2010) . However, despite enormous biological and clinical relevance, the molecular trajectories that cells traverse during lineage commitment remain poorly understood. Seminal experimental work in the mouse has suggested a model where individual HSCs undergo a gradual loss of pluripotency and pass through distinct intermediate progenitors represented by a series of binary branchings, with the first lineage decision representing either myelo-erythroid or lymphoid specification (Akashi et al., 2000; Kondo et al., 1997).Recent studies, however, have proposed both minor and major alterations to the structure of the traditional model, for example positing a direct path from HSC to erythroid and megakaryocytic lineages (Adolfsson et al., 2005) or demonstrating diverse lineage origins for myeloid cells (Drissen et al., 2016; Franco et al., 2010; Görgens et al., 2013), all highlighting a lack of consensus regarding the molecular nature of early fate transitions in human hematopoiesis. The evidence for each of these models is based primarily on the enrichment of putative progenitor cell populations from fluorescence-activated cell sorting (FACS). Even slight differences in the surface markers utilized, gating strategy for enrichment, or downstream assay conditions can skew the output and interpretation of these experiments (Etzrodt et al., 2014; Paul et al., 2015). Moreover, the protocols used to identify intermediate progenitor types can vary widely between different labs, necessitating unbiased approaches to define transition states at the single cell level (Levine et al., 2015; Nestorowa et al., 2016). This is particularly true in human hematopoiesis, as well-characterized markers in mouse (i.e. Sca-1) do not directly translate to human systems (Doulatov et al., 2012).By contrast, single cell RNA-seq can provide a detailed molecular characterization of single cells that is highly complementary to traditional differentiation or FACS-based phenotyping approaches (Chattopadhyay et al., 2014). Massively parallel approaches that barcode cells in early stages of library preparation have enabled the routine profiling of thousands of single cells in developmental systems (Jaitin et al., 2014; Klein et al., 2015; Macosko et al., 2015). For example, a massively parallel scRNA-seq study of thousands of myeloid-restricted cells from the mouse bone marrow, poignantly demonstrated that individual cells in this pool (which was depleted for the early progenitors expressing stem cell marker Sca-1) had largely committed to individual lineages (Paul et al., 2015). Additionally, a recent pioneering study of human bone marrow CD34+ cells combining single-cell transcriptional and functional analysis (Velten et al., 2017) highlighted the continuous nature of early hematopoietic differentiation, and concluded that lineage commitment was not characterized by distinct branching during early transitions. In particular, the authors focused on the presence of a cloud-like population of early progenitors forming a fully interconnected hierarchy, which then gave rise to distinct lineage-committed populations.These complementary datasets and findings suggest that oligopotent progenitors may play a reduced role compared to initially proposed hierarchical models, but raise a pressing question for human hematopoiesis : can all progenitors be stratified into uncommitted and unipotent populations? Is there molecular evidence for multi-lineage priming in early progenitors, perhaps evidenced by co-expression of multiple genes in early cells that later become restricted to a single lineage? A deeper exploration of these questions could help bridge the insights derived from scRNA-seq and complementary techniques, including in-vivo barcoding assays and both in-vivo and in-vitro differentiation experiments, all of which reveal evidence for oligopotent states, albeit with non-uniform lineage outputs. In particular, recent work suggested that human umbilical cord blood could be a particularly attractive model to study, as this system contains a greater percentage of cells in oligopotent intermediate states compared to adult progenitors (Notta et al., 2016) .In this study, we applied Drop-seq to individually sequence 21,306 stem and progenitor cells from umbilical cord blood, spreading across five human donors. Our data revealed the presence of single cell clusters whose expression profiles were in tight agreement with previously defined progenitor populations, in addition to the presence of distinct myeloid progenitor subtypes. Importantly, we also observed a continuum of transitioning states that links these progenitor groups. We exploited these transitions to computationally reconstruct molecular trajectories connecting HSCs to four hematopoietic lineages. Our scRNA-seq data revealed strong evidence for recently proposed models that human myeloid cells can arise from distinct trajectories, with a subset of granulocytes sharing early molecular transitions with erythroid progenitors. We also observed that early progenitors co-expressed 'primed' programs for multiple downstream lineages, indicating multi-lineage priming within intermediate stages. By integrating scRNA-seq datasets from cord blood and bone marrow, we further demonstrated strong molecular conservation in both systems, and identify epigenetic trends that correlate with transcriptomic dynamics. Finally, we show that by coupling scRNA-seq with immunophenotyping measurements, Drop-seq data can be used to suggest new heterogeneous markers of cell state and lineage potential, which we validate through in vitro differentiation assays. Our results shed significant new light on the molecular nature of early fate transitions in human hematopoiesis, and highlight the exciting potential for high-throughput single cell analysis to de-convolve developmental systems.RESULTSUnbiased identification of cellular diversity in cord blood CD34+ cellsTraditionally, hematopoietic single cell analysis begins with FACS-mediated separation for specific cellular populations based on known surface marker expression, which enables the enrichment of extremely rare cell types (citation). In contrast, a relatively unbiased sampling strategy is preferred given the dramatic increase in scale enabled by Drop-seq, a recently developed massively parallel single cell library preparation technique (citation). The cord blood CD34+ pool has been widely accepted as a rich source of hematopoietic stem and progenitor cells (citation), and we therefore sought to minimize our sample preparation steps prior to single cell profiling, using enriched CD34+ cells from five umbilical cord blood units using density centrifugation and magnetic separation (Methods). In Drop-seq, single cells were co-encapsulated with barcoded beads in oil-based droplets, followed by pooled library preparation, sequencing, read alignment and gene quantification based on unique molecular identifiers (UMIs), as previously described (citation). Overall, our dataset contains 21,306 single cells across five biological replicates after initial filtering based on technical metrics (Methods), with an average of 8,975 mapped reads/cell. In total, 30,730 genes were detected across all cells, and 2,154 UMIs were assigned to each cell on average, which compare favorably to the recently published Drop-seq datasets (citation). After regression-based correction of cell cycle effects and technical covariates in our single cell data, we sought to identify the transcriptional subtypes of various abundances within the profiled CD34+ pool, by extending our previously developed clustering strategy from Drop-seq data (citation). Briefly, we performed linear dimensional reduction to reduce the dataset to 24 independent components, and then identified single cell clusters using a community detection method which has recently been applied to CyTOF data (citation). We identified a few rare clusters that likely represent CD34low/- cells that passed through the CD34 selection column, including CD3D+ T cells, KLDB1+ NK cells, MS4A1+ B cells and C5AR1+ myeloid cells (Figure S1), and we excluded these cells from all downstream analysis. Together, 10 clusters were retained, and we further identified the top transcripts most enriched within each cluster, displaying the 10 clusters alongside their top transcriptional markers in a heat map shown in Figure 1B. We also ensured that our clusters were reproducible across all five biological replicates. Briefly, the percentage of cells assigned to each cluster was highly consistent, and there is strong inter-individual reproducibility when we examined the expression of genes averaged within each replicate per cluster (Figure S1C). We therefore concluded that the cells from our five independent cord blood units spanned the same range of gene expression states, and at similar densities.We then sought to understand if our 10 CD34+ single cell clusters contained subtypes consistent with well-described hematopoietic progenitor populations identified from functional studies. A recently published microarray dataset was taken as reference, which contained the bulk expression profiles for sorted populations, including hematopoietic stem cells (HSCs), multipotent progenitors (MPPs), megakaryocyte-erythroid progenitors (MEPs), granulocyte-macrophage progenitors (GMPs) and multi-lymphoid progenitors (MLPs) (citation). We compared the profiles of these sorted progenitors with our identified clusters based on top marker expression, and observed strikingly that near-perfect overlap was seen for many of our clusters, enabling us to assign putative biological identities (Figure 1C). For example, C3 was characterized by erythroid fate-regulators and markers GATA1, CD36, and KLF1 (van Schravendijk et al., 1992; Pevny et al., 1991) and the entire set of markers we identified in C3 were highly enriched for expression in the megakaryocyte-erythroid progenitor (MEP) reference microarray profiles (Figure 1C). Similarly, we assigned C6 to an HSC/MPP identity, C9 to a granulocyte/macrophage progenitor (GMP), and C10 to the multi-lymphoid progenitor (MLP). We also identified a second cluster similar to HSC (C5), expressing similar markers as well as ZFP36 and a set of small RNAs, potentially representing HSC in a different metabolic state (Cheung and Rando, 2013). We did not, however, discover a cluster whose gene expression patterns were consistent with common myeloid progenitors (CMPs). This observation is consistent with the possibility that the currently defined human CMP represents a heterogeneous mixture of erythroid and myeloid-primed progenitors, as has recently been demonstrated in single cell analyses of murine bone marrow (Paul et al., 2015; Perié et al., 2015).We also identified distinct groups of cells that were not identified in previous reference datasets. For example, gene expression of C2 was similar to our MEP subgroup, but contained unique up regulation of genes whose expression has been previously associated with the development of mast cells (TPSAB1, a mast cell specific protease), basophils (LMO4; Paul et al., 2015), and eosinophils (MS4A2, FCER1A; Eon Kuek et al., 2016). Notably, in contrast to our GMP-like cluster (C9), C2 cells did not express neutrophil or monocytic genes, such as MPO, ELANE, LYZ, or CSF3R (Lau et al., 2005; Rotival et al., 2011). We therefore suspected these cells represented Basophil/Eosinophil/Mast cell progenitors (Ba/Eo/Ma), potentially similar to a FCϵRI+ population that was recently identified and validated in human bone marrow as an early precursor of mast cells (Dahlin et al., 2016). Cluster 1 was uniquely marked by high expression of clear megakaryocytic (Mk) markers including PF4, ITGA2B, and CLEC1A, suggesting a putative Mk progenitor identity for C1 cells.Additionally, our CD34+ subsets also consisted of transitioning populations that are abundant in human hematopoiesis, but lack well-characterized surface markers. For example, C4 cells lacked the expression of mature erythroblast markers, but expressed high levels of GATA2 and KIT, likely representing a transitioning population along the early stages of erythroid commitment. Similarly, clusters 7 and 8 were marked by elevated expression of FLT3 and CSF3R with gradually reduced levels of stem cell markers, likely representing populations similar to the lymphoid-primed multipotent progenitor (LMPP) that have previously been described in mouse (Adolfsson et al., 2005). These populations also exhibited high expression of L-Selectin (SELL; CD62L), consistent with a recent study which identified CD62L as an early marker of lympho-myeloid fate commitment in human bone marrow (Kohn et al., 2012). Taken together, we conclude that our broad sampling strategy enabled us to capture the continuous spectrum of blood differentiation, including both metastable progenitor states, as well as a spectrum of transitioning cells in between.Computational reconstruction of the early hematopoietic fate transitionsWhile clustering analyses categorizes cellular diversity and reveals the continuous nature of hematopoietic development, they impose a discrete framework and do not explicitly address the global topology of the progenitor states. Given the continuum identified from our clustering analyses, we reasoned that developmental trajectories could be reconstructed from single-cell 'snapshot' data to reveal the global structure of various progenitor states that followed a pseudo-temporal progression, by building a computational strategy for unbiased reconstruction (Figure 2A). Our strategy was tailored towards the reconstruction of cell-state hierarchies in single cell RNA-seq data, but shared an approach with pioneering methods that have been developed for recovering branched trajectories in CyTOF data (Qiu et al., 2011). Briefly, we began by ‘micro-clustering’ our dataset, subdividing our clusters into bins of 20 cells based on their relative developmental progression (STAR Methods). These micro-clusters therefore evenly spanned the hematopoietic continuum, yet gene expression variability within an individual micro-cluster was consistent with stochastic Poisson noise generated from a homogeneous population (Figure S2A, B). We therefore represent each micro-cluster by the average expression of all its cells, dramatically reducing the noise in single cell RNA-seq driven by sparse sampling. We then identified the hierarchy of the micro-clusters using minimum spanning trees, a commonly used approach for reconstructing topological relationships by learning the most parsimonious set of paths connecting all data points (Qiu et al., 2011; Trapnell et al., 2014). Notably, we used a bootstrapped approach to assess the reproducibility this final step, repeatedly running the spanning tree estimation on subsamples of our dataset, and ensuring that our results represent a single robust and highly branched hierarchy. Though the spanning tree is unrooted, we oriented the tree so that the HSC sits atop the hierarchy (Figures 2B, S2C). For the hierarchical relationships relating HSC to Erythroid, Basophil/Eosinophil/Mast, Neutrophil/Monocyte, and Lymphoid fates, we obtained identical results in 100% of the 500 bootstraps. However, our bootstrapping revealed inconsistencies in the placement of Mk-committed progenitors (Figure S2D), likely related to their rarity (0.7%), and we therefore focused on the remaining lineages for which we could draw robust conclusions, shown in Figure 2C, along with the dynamic expression levels of illustrative key developmental regulators and markers (Figure 2D).Our computationally built hierarchy uncovers four committed fates for cord blood CD34+ progenitors: Basophil/Eosinophil/Mast cells (Ba/Eo/Ma), Erythroblasts (Er), Neutrophils/Monocytes (Neu/Mo) and Lymphoid cells (Lym). The initial transition out of multi-potency is marked by a reduction in HSC-markers, and concomitant up-regulation of CDK6, which has been shown to control the exit from quiescence in cord blood hematopoiesis (Laurenti et al., 2015). We next observed two intermediate states connecting HSC/MPP to the committed progenitors. As shown in Figure 2B, the intermediate states did not separate into clear binary fate choices, echoing recent findings that early progenitors exist in a low-primed and continuous state (the 'CLOUD') Velten). However, as cells progressed and continued to down-regulate HSC markers, we observed segregation of intermediate progenitors into more distinct fate choices (Figure 2B), even prior to the commitment to a single lineage. One path was characterized by the up regulation of transcription factors and cell surface proteins that are commonly associated with the LMPP in both mouse and human, namely, FLT3, CSF3R, and SELL. This early fate transition drives differential expression of hundreds of downstream genes (Figure 3A), and is in tight agreement with revised hematopoietic models in both human and mouse that feature a lympho-myeloid competent progenitor cell lacking erythroid potential (Adolfsson et al., 2005; Kohn et al., 2012). LMPP cells subsequently subdivided into MPO-expressing neutrophil/monocyte progenitors, or MME (CD10)-expressing lymphoid progenitors (Galy et al., 1995). These results recapitulate well-established human hematopoietic models (Doulatov et al., 2012; Notta et al., 2016).Distinct transcriptomic trajectories give rise to distinct myeloid subsetsAlternately, a separate path was marked by sharp up-regulation of GATA2, alongside a robust gene set whose expression was widely shared between the Ba/Eo/Ma and Er lineages, but absent from LMPP and Neu/Mo progenitors. These included both well-studied (GATA1/2, IKZF2, STAT5A), and putative (AFF2, ZBTB16, BMP2K, CTNNBL1) regulators (Figure S3A). We therefore conclude that the sharp-upregulation of GATA2 coincides with entry into an intermediate state resembling an 'erythro-myeloid' progenitor (EMP), a recently described oligopotent progenitor state in mouse bone marrow that is largely driven by the activity of GATA factors. Notably, genes that were induced during the transition from HSC to LMPP - FLT3, CSF3R, SELL, CD99, were absent from the GATA2-dependent trajectory, including the Ba/Eo/Ma progenitors, but maintained high levels of expression in Neu/Mo progenitors.Taken together, our data strongly suggest that different myeloid subpopulations are generated via strikingly distinct trajectories. In particular, the early stages of a Neu/Mo stages are characterized by molecular transitions that are shared with early transitions accompanying lymphoid commitment, passing through an LMPP-like state as previously described in well-established models. By contrast, a smooth path from HSC to Ba/Eo/Ma begins with the induction of GATA2, alongside a module of genes that are often associated with the early stages of erythroid development, representing an EMP-like intermediate stage.Whileour reconstructed models represent computational hypotheses derived 'snapshot' data, they therefore provide complementary thatevidence that a recently proposed revised hematopoietic model in mouse, where distinct myeloid subsets are generated through GATA-dependent and independent pathways, also is relevant for human hematopoiesis. Indeed, previous functional data also supported a similar model in humans (Gorgens), but was dependent on surface marker enrichment and is complemented by our unsupervised molecular characterization. As the Ba/Eo/Ma progenitor is missing from standard human hematopoietic models, we thereforeintersected markers of this population with the Laurenti reference dataset, and observed that our Ba/Eo/Ma markers (e.g. TPSAB1, HDC) were most highly expressed in the ‘MEP’ gate, indicating that these progenitors express an immunophenotype that mixes them into a traditional MEP gate. As a phenotypic validation, we used flow cytometry to examine the surface phenotype of CD34+ CD117+ FceRI+ cells, representing a recently identified mast cell progenitor, and found that > 95% of cells in this gate expressed high levels for the transferrin receptor (TFRC/CD71), a marker commonly used to enrich for erythroid-committed cells (Figure 5B, citation for CD71). Taken together, these data suggest that the Ba/Eo/Ma progenitor is transcriptomically and immunophenotypically similar to erythroid-committed cells, consistent with a model where thesethe celldifferentiadifferentiation populationstrajectories shareof earlythese lineages share upstream molecular transitions. Characterizing 'primed' and 'de novo' transcriptional dynamicsOur reconstructed hierarchy provides a rich scaffold to understand the molecular transitions underlying fate commitment from HSC to downstream lineages. To globally identify genes involved in fate transitions, we designed an unsupervised strategy to identify genes with dynamic expression across the hierarchy. Briefly, we assigned pseudo-temporal time points to each micro-cluster according to their positions on the hierarchy, and individually examined the transition from stem cells to each committed lineage to look for genes differentially expressed during individual key fate decisions (Methods). We note that the strategy for gene calling was based on the identification of 'branch-points' within the topology, and this represents a simplification of our data, which show gradual commitment in early progenitors as opposed to distinct and clear branching patterns. However, this simplification was necessary for performing differential expression. More importantly, when identifying dynamic genes, we placed more weight on micro-clusters that were further from the initial 'branch point', consistent with the increasing degree of segregation that we observe in our data. Together, we identified a total of 517 genes having mRNA levels correlated with temporal progression of cell fates. We next performed constrained k-means clustering (Fritz et al., 2012) to group together genes with similar transcriptional dynamics across the entire hematopoietic hierarchy, identifying nine distinct gene modules (Figure 3A, B; Table S1).We first performed GO enrichment analysis (Kuleshov et al., 2016) for each module, and found that functional enrichments were in broad agreement with expression dynamics, including ‘hemopoiesis’ and ‘erythrocyte differentiation’ for Er-related programs, defense response for Neu/Mo modules, and lymphocyte proliferationand signaling pathways for lymphoid markers (Figure 3C). We also searched the 20 kilobases regions around transcription start sites (TSSs) of genes within each module for over-represented motifs (Herrmann et al., 2012), and found a striking separation between gene sets whose expression varied across the first hematopoietic fate transition (left vs. right side of Figure 3A). All gene sets up-regulated in the LMPP and downstream cells were strongly enriched for ETS-motifs (Figure 3D), which can be bound by PU.1, ETS1, ELK4, and other transcription factor families (Sharrocks, 2001). These motif enrichments mirrored the expression patterns of the transcription factors themselves (Figures 3E and S3A), as regulators with ETS-binding domains exhibited higher expression in LMPP compared to EMP. In contrast, genes up-regulated in the EMP exhibited strong enrichments for GATA motifs, again mirroring the expression dynamics of GATA factors. Taken together, these results indicate the regulatory roles of ETS and GATA factors in driving early fate transitions, where one route is driven by a wave of binding to GATA-motifs, and the other route is driven through ETS-motif activation by multiple transcriptional regulators.A closer look into our gene modules reveals that these gene sets exhibit complex patterns that extend beyond the concept of cluster markers. Whereas cluster markers suggest the enrichment of transcripts shared within a group of cells, our transcriptional programs demonstrate the expression dynamics specifically correlated with the progression of cell fates from early to late pseudo-temporal time points. By carefully examining the dynamic patterns, we could categorize these modules into 'primed' and 'de novo' programs based on their expression patterns upstream. For example, MME (CD10) is a canonical marker of lymphoid committed cells (Galy et al., 1995). While it is not expressed in early progenitors, it is activated ‘de novo’ upon lymphoid commitment. Similarly, genes with the same pattern as MME are together named as the 'de novo' lymphoid program. In contrast, the module containing the lymphoid-malignancy marker CD52 (Rodig et al., 2006) was detected as early as the HSC, extended through LMPP, and maintained its expression throughout lymphoid differentiation, but was down-regulated during fate commitment for all other lineages, hence named as the 'primed' lymphoid program. We also named other modules as 'primed'/'de novo' Neu/Mo, 'primed'/'de novo' EMP, 'de novo' Er and 'de novo' Ba/Eo/Ma programs according to the dynamic pseudo-temporal patterns (Table S3). With the identification of lineage-specific programs, we wondered if we could identify cells that co-express genes from programs of different lineages. Particularly, it would be interesting to see if such phenomenon exist in early progenitors, for co-expression could represent intermediate and multi-potent stages. Indeed, as in Figure 3A, LMPPs are co-expressing genes from both the 'primed' Lymphoid and the 'primed' Neu/Mo programs, but have yet to activate the ‘de novo’ programs for either lineage. This observation confirms the multi-potency of LMPPs as a transcriptomically intermediate stage inferred from single cell clustering and topology reconstruction. Similarly, the HSC populations co-express primed programs for all downstream lineages simultaneously. In addition, multi-lineage priming was observed in the context of 'primed' programs only, whereas 'de novo' genes were activated in cells upon commitment following the expression of 'primed' genes. This is different from most existing statements in the field, where multi-lineage priming was described for genes in the 'de novo' modules (Velten). Our analyses also reveal the dynamics of gene expression in fate commitment. Specifically, 'primed' programs for other lineages were down regulated, together with the activation of a corresponding 'de novo' gene set, which signifies the commitment toa distinct cellular fate. Early lineage priming is conserved between human bone marrow and cord bloodWhile our developmental hierarchy and transcriptional dynamics were based on CD34+ cells collected from umbilical cord blood, we wondered whether our conclusions were unique to this system, or were consistent with CD34+ cells in human bone marrow. Specifically, we wished to ask 1) whether similar transcriptional states and markers could be used to define progenitor populations in both tissues, 2) If so, whether Neu/Mo and Ba/Eo/Ma myeloid subsets in bone marrow exhibited distinct expression of 'primed' programs as we observed in cord blood, and 3) if present, do 'LMPP' cells in bone marrow co-express 'primed lymphoid' and forco-'primed Neu/Mo' modules as we see in cord blood. To facilitate this comparison, we applied our recently developed single cell integration approach to our Drop-seq micro-clusters,and the Velten et al. dataset profiled with Quartz-Seq.This approach applies canonical correlation analysis and non-linear warping techniques to 'align' subpopulations that exist in both datasets, based on shared sources of variation. We have previously demonstrated that this approach can successfully align two murine datasets of murine bone marrow hematopoietic progenitors produced with two different technologies, and therefore asked whether we could detect shared subpopulations between two human hematopoietic tissues.Indeed, integrated analysis successfully aligned both datasets (Figure 4A), with bone marrow CD34+ cells mapped to each of our Drop-seq progenitor states: HSC/MPP, LMPP, EMP, Er, Ba/Eo/Ma, Neu/Mo and Lym progenitors (Figure 4B). Strikingly, markers defining these states were extremely well conserved in both datasets, even though they were measured in different tissues and differenttechnologies (Figure 4C), and tSNE-visualization was strongly consistent with the same developmental hierarchy as we observed in umbilical cord blood. Cell type proportions were also quite similar betweentwo tissues, though we did observe a slight proportional shift towards more committed progenitors in bone marrow, consistent with the results from Notta etal. Taken together, our integrative analysis strongly suggests that the transcriptional states defining CD34+ progenitors in cord blood and bone marrow are tightly conserved. We next examined the expression of 'primed' programs in the early progenitor subsets. Here we applied a scoring method developed by Tirosh et al, to measure the expression level of a gene module in each cell while controlling for sampling-induced sparsity and dropout. As observed in cord blood, expression scores for 'primed' EMP and 'primed' LMPP programs strongly segregated Neu/Mo and Ba/Eo/Ma progenitors in cord blood, (supplementary figures). Similarly, cells annotated as the GATA2+ lineage from bone marrow had a higher score for the 'primed' EMP program, and a lowered LMPP score than the bone marrow FLT3+ cells (Figure 4D). In particular, we identified the Ba/Eo/Ma progenitor profile in both systems, and found that Ba/Eo/Ma lineage up regulated the 'primed' EMP program, and was relatively depleted of the 'primed' LMPP program compared with Neu/Mo progenitors in bone marrow as well. A closer look into bone marrow LMPPs also revealed the same multi-lineage priming phenomenon as seen in cord blood LMPPs, where both 'primed' lymphoid and 'primed' Neu/Mo modules were both enriched compared with EMP (Figure 4E?). Taken together, we concluded that the trajectories for early hematopoietic fate transitions were conserved between human bone marrow and umbilical cord blood.Hematopoietic fate transitions were reflected on chromatin accessibility
As the sheer volume of bioinformatic sequence data increases, the only way to take advantage of this content is to more completely automate robust analysis workflows. Analysis bottlenecks are often mundane and overlooked processing steps. Idiosyncrasies in reading and/or writing bioinformatics file formats can halt or impair analysis workflows by interfering with the transfer of data from one informatics tools to another. Fasta-O-Matic automates handling of common but minor format issues that otherwise may halt pipelines. The need for automation must be balanced by the need for manual confirmation that any formatting error is actually minor rather than indicative of a corrupt data file. To that end Fasta-O-Matic reports any issues detected to the user with optionally color coded and quiet or verbose logs. Fasta-O-Matic can be used as a general pre-processing tool in bioinformatics workflows (e.g. to automatically wrap FASTA files so that they can be read by BioPerl). It was also developed as a sanity check for bioinformatic core facilities that tend to repeat common analysis steps on FASTA files received from disparate sources. Fasta-O-Matic can be set with format requirements specific to downstream tools as a first step in a larger analysis workflow. Fasta-O-Matic is available free of charge to academic and non-profit institutions at https://github.com/i5K-KINBRE-script-share/read-cleaning-format-conversion/tree/master/KSU_bioinfo_lab/fasta-o-matic.
BACKGROUND: Genome assembly remains an unsolved problem. Assembly projects face a range of hurdles that confound assembly. Thus a variety of tools and approaches are needed to improve draft genomes. RESULTS: We used a custom assembly workflow to optimize consensus genome map assembly, resulting in an assembly equal to the estimated length of the _Tribolium castaneum_ genome and with an N50 of more than 1 Mb. We used this map for super scaffolding the _T. castaneum_ sequence assembly, more than tripling its N50 with the program Stitch. CONCLUSIONS: In this article we present software that leverages consensus genome maps assembled from extremely long single molecule maps to increase the contiguity of sequence assemblies. We report the results of applying these tools to validate and improve a 7x Sanger draft of the _T. castaneum_ genome. KEYWORDS: Genome map; BioNano; Genome scaffolding; Genome validation; Genome finishing