Assembly and super scaffolding with multiple genera.
We examined experiments from 16 different genera to determine if the results seen for the Tribolium castaneum genome are typical for other genomes as well. The T. castaneum genome map N50 was found to be in the high end of the probability density distribution (Additional file 6; Figure 1). The same is true for the Tcas5.0 draft sequence assembly N50 and percent of N50 improvement after super scaffolding compared to the other 17 of 19 total projects that had draft sequence genomes (Additional file 6; Figure 1). However, in no case was the T. castaneum value the highest value recorded, suggesting that a wide range of output quality is possible including values better and worse than the output for T. castaneum.
We checked for evidence of correlations between a range of genomic metrics and map assembly, alignment or FASTA super scaffolding results. Because many of the genomic metrics had very broad ranges with variance that increased often for higher values the genomic metrics were log transformed to compress the upper tails and stretch the lower tails of the distributions.
Overall we found little correlation between either sequence FASTA N50, molecule map coverage or molecule map label density and final genome map N50. We did, however, find correlations between finished map and sequence assembly metrics and alignment and super scaffolding quality. There is a positive correlation between high value sequence assembly metrics and in silico map-to-genome map alignment metrics (Additional file 6; Figures 3-5) as well as post super scaffolding N50 improvement (Additional file 6; Figures 3,5). There is also a positive correlation between high value genome map assembly metrics and post super scaffolding N50 improvement (Additional file 6; Figures 3,5). However, no direct correlation was found between sequence assembly N50 and genome map N50 (Additional file 6; Figures 4-5). Taken together the analysis suggests different factors may determine sequence assembly and genome map assembly quality. Although sequence assembly N50 may not be useful to predict genome map N50, if both independent assemblies have high N50’s than more of the map lengths may align and super scaffolding may be more productive.
The low degree of correlation found between genome map N50 and sequence N50 may stem from steps unique to the molecule map imaging process. It might be expected that a genome with sequence that assembles well may have qualities that would also favor molecule map assembly (e.g. low repeat content, low ploidy, inbreed lines, etc.). However molecule map assembly is also influenced by unique factors like frequency of fragile sites (two labels occurring on opposite strands in close proximity), labeling efficiency and ability to extract high molecular weight DNA all of which vary for different organisms.
Principal component analysis suggests a negative correlation between labels per 100 kb and molecule coverage (Additional file 6: Figures 2-3). The correlation between labels per 100 kb and molecule coverage was weakly significant in individual regression (Additional file 6; Figures 4-5). Labels per 100 kb are monitored as molecules are being imaged. Lower than expected label density can occasionally lead to further labeling reactions or other adjustments to data collection and therefore greater depth of coverage.
Overall, comparison of the results for the T. castaneum genome and 19 additional genome projects suggest that results may vary widely from project to project. Many factors may contribute to this effect including the quality of the sequence assembly, degree of divergence between the organism or organisms used to extract DNA, success of extraction and labeling of high molecular weight DNA, genome size and genome complexity. In fact, the tendency for assemblies from the same genera or species to cluster together on the PCA plots suggests that organism-specific qualities may influence assembly, alignment or super scaffolding results (Additional file 6: Figures 2-3). Although analysis of more projects is needed to determine if these similarities are meaningful predictors of output quality.
JMS, MCC, NH, NL, and SJB declare that they have no competing interests. ETL, PS and TA are employees at BioNano Genomics and hold stock options.
Matthias Weissensteiner & Jochen Wolf, Uppsala University. Stephen Schaeffer from The Pennsylvania State University and Stephen Richards from the Baylor College of Medicine Human Genome Sequencing Center for the use of the D. pseudoobscura data. Mike Kanost from Kansas State University. Jeff Maughan from Brigham Young University for the use of the Amaranth data. The Udall Lab from Brigham Young University and Cotton Inc. for the use of the cotton data. Grant (NSF 1237993) for use of the Medicago data. Christopher Cunningham, University of Georgia for the use of Nicrophorus data. Catherine Peichel from the Fred Hutchinson Cancer Research Center and Michael White from the University of Georgia for the Gasterosteus data. Mirkó Palla, Ph.D., Wyss Institute Postdoctoral Fellow, Church; Laboratory - Department of Genetics, Harvard Medical School and George Church, Ph.D., Wyss Institute Core Faculty Member, Robert Winthrop Professor of Genetics at Harvard Medical School, Professor of Health Sciences and Technology at Harvard and MIT, and Senior Associate Member at the Broad Institute of Harvard and MIT for the Escherichia coli data.