Additional file 6

loading page

Additional file 6

Jennifer Shelton,
Cassondra Coleman,
Nic Herndon,
Nanyan Lu,
Ernest Lam,
Thomas Anantharaman,
Palak Sheth,
Sue Brown

Abstract

Assembly and super scaffolding with multiple genera.

We examined experiments from 16 different genera to determine if the results seen for the Tribolium castaneum genome are typical for other genomes as well. The T. castaneum genome map N50 was found to be in the high end of the probability density distribution (Additional file 6; Figure 1). The same is true for the Tcas5.0 draft sequence assembly N50 and percent of N50 improvement after super scaffolding compared to the other 17 of 19 total projects that had draft sequence genomes (Additional file 6; Figure 1). However, in no case was the T. castaneum value the highest value recorded, suggesting that a wide range of output quality is possible including values better and worse than the output for T. castaneum.

We checked for evidence of correlations between a range of genomic metrics and map assembly, alignment or FASTA super scaffolding results. Because many of the genomic metrics had very broad ranges with variance that increased often for higher values the genomic metrics were log transformed to compress the upper tails and stretch the lower tails of the distributions.

Overall we found little correlation between either sequence FASTA N50, molecule map coverage or molecule map label density and final genome map N50. We did, however, find correlations between finished map and sequence assembly metrics and alignment and super scaffolding quality. There is a positive correlation between high value sequence assembly metrics and in silico map-to-genome map alignment metrics (Additional file 6; Figures 3-5) as well as post super scaffolding N50 improvement (Additional file 6; Figures 3,5). There is also a positive correlation between high value genome map assembly metrics and post super scaffolding N50 improvement (Additional file 6; Figures 3,5). However, no direct correlation was found between sequence assembly N50 and genome map N50 (Additional file 6; Figures 4-5). Taken together the analysis suggests different factors may determine sequence assembly and genome map assembly quality. Although sequence assembly N50 may not be useful to predict genome map N50, if both independent assemblies have high N50’s than more of the map lengths may align and super scaffolding may be more productive.

The low degree of correlation found between genome map N50 and sequence N50 may stem from steps unique to the molecule map imaging process. It might be expected that a genome with sequence that assembles well may have qualities that would also favor molecule map assembly (e.g. low repeat content, low ploidy, inbreed lines, etc.). However molecule map assembly is also influenced by unique factors like frequency of fragile sites (two labels occurring on opposite strands in close proximity), labeling efficiency and ability to extract high molecular weight DNA all of which vary for different organisms.

Principal component analysis suggests a negative correlation between labels per 100 kb and molecule coverage (Additional file 6: Figures 2-3). The correlation between labels per 100 kb and molecule coverage was weakly significant in individual regression (Additional file 6; Figures 4-5). Labels per 100 kb are monitored as molecules are being imaged. Lower than expected label density can occasionally lead to further labeling reactions or other adjustments to data collection and therefore greater depth of coverage.

Overall, comparison of the results for the T. castaneum genome and 19 additional genome projects suggest that results may vary widely from project to project. Many factors may contribute to this effect including the quality of the sequence assembly, degree of divergence between the organism or organisms used to extract DNA, success of extraction and labeling of high molecular weight DNA, genome size and genome complexity. In fact, the tendency for assemblies from the same genera or species to cluster together on the PCA plots suggests that organism-specific qualities may influence assembly, alignment or super scaffolding results (Additional file 6: Figures 2-3). Although analysis of more projects is needed to determine if these similarities are meaningful predictors of output quality.