Results and Discussion

Dataset generation

High molecular weight DNA was isolated from young Tribolium castaneum pupae from the GA2 line that was inbred 20 generations. The GA2 line was also used for the genome sequence-based assembly.

Using Knickers (BioNano Genomics), an in silico label density calculator, we estimated that the Tribolium castaneum genome had 8.66 and 5.51 labels per 100 kb for the nt.BspQI and nt.BbvCI enzymes (New England BioLabs) respectively. The ideal number of labels per 100 kb is between 10 and 15 therefore we nicked with both enzymes. DNA was nicked, labeled with fluorescent nucleotides, and repaired according to BioNano protocol; and 93 BNX files were produced from the Irys genome mapping system (Figure 4 and Additional file 2). Four corrupted files (cumulative length = 0) were excluded from this analysis. All T. castaneum BNX files have been deposited to labarchives (doi:10.6070/H4V69GK3).

The high number of BNX files produced is due to several factors. Typically one BNX file is produced per flowcell; however, in certain cases after the initial number of scans, additional scans were run producing an additional BNX file. Another reason for the large number of files is that data was originally generated using the IrysChip V1 while the Irys System was under beta testing. Over time, maximum cumulative length per BNX file increased (Figure 4 and Additional file 2). After the upgrade to the IrysChip V2, an increase was also observed in maximum data per BNX file (Figure 4 and Additional file 2).

Molecule map quality metrics were calculated using bnx_stats (version 1.0). We generated \(\sim\)250x coverage of the T. castaneum genome for single molecule maps \(>\)150 kb, the default minimum molecule map length. The 239,558 single molecule maps with lengths \(>\)150kb had an N50 of 202.63 kb and a cumulative length of 50.6 Gb (Table 1). Histograms of per-molecule quality metrics for maps after applying a minimum length filter of either 100 kb, 150 kb and 180kb are reported in Additional file 3.

To generate the in silico maps, we used the sequence-based assembly scaffolds from version 5.0 of the Tribolium castaneum genome (Tcas5.0). Version 5.0 (Table 2) of the T. castaneum genome is an updated version of the sequence assembly that was created by adding 1.03 Mb of sequence to version 3.0 \cite{Beetle2008}. Two hundred and twenty-three of the 2240 scaffolds within Tcas5.0 were longer than 20 kb with more than 5 labels, the minimum requirements for an in silico map. These longer sequence scaffolds represent the bulk of the sequence assembly, 152.53 of the 160.74 Mb (Table 2).

Assembly: Selecting the Optimal BioNano Assembly

Single molecule maps were assembled de novo into five distinct BioNano genome maps for T. castaneum.

Molecule maps were prepared for assembly (Figure 2C) and noise parameters were estimated (Figure 2D-E) using AssembleIrysCluster (version 1.0). Specifically, AssembleIrysCluster calculated three -T parameters (5e-10, 5e-09 and 5e-08) based on the estimated genome size, and adjusted molecule map stretch and estimated molecule map noise parameters based on alignment to in silico maps. Assemblies with these p-value thresholds are named \(Relaxed\text{-}T\), \(Default\text{-}T\) and \(Strict\text{-}T\), respectively.

In the first round of selection, the \(Strict\text{-}T\) assembly was the best of these three assemblies because it has a cumulative size close to 200 Mb (Table 2 and Figure 5), the estimated size of the T. castaneum genome, and a small difference between non-redundant aligned length or breadth of alignment, and total aligned length (Figure 5, Table 3 and Additional file 4). Thus in the second round of selection, \(Strict\text{-}T\) parameter was used for two further assemblies that had relaxed minimum molecule length (\(Relaxed\text{-}Minlen\)) of 100 kb rather than the 150 kb default or a strict minimum molecule length (\(Strict\text{-}Minlen\)) of 180 kb. The \(Strict\text{-}TAndStrict\text{-}Minlen\) assembly improved alignments by reducing redundancy slightly. However the cumulative length of the assembly was 21.45 Mb smaller than the estimated genome size. The \(Strict\text{-}TAndRelaxed\text{-}Minlen\) assembly had a worse cumulative length and alignment redundancy than the \(Strict\text{-}T\) assembly. Because neither of the assemblies using 100 or 180 kb as the minimum molecule length improved both assembly metrics when compared to the \(Strict\text{-}T\) assembly, generated in the first round of selection with the default \(Minlen\) of 150 kb, the \(Strict\text{-}T\) assembly will be referred to as the T. castaneum consensus genome maps in further analysis.

The T. castaneum consensus genome maps have an N50 of 1.35 Mb, a cumulative length of 200.47 Mb, and 216 genome maps (Table 2 and Figure 5). T. castaneum consensus genome maps aligned to 124.04 Mb of the in silico maps created from the Tcas5.0 assembly validating 81% of the draft sequence assembly (Table 3 and Figure 5). Assembly metrics were calculated using the BNGCompare script (version 1.0). More detailed assembly metrics for all five assembled consensus genome maps are available in Additional file 4. Assembled T. castaneum genome maps have been deposited to labarchives (doi: 10.6070/h42f7kf2).

Stitch: Automated and Manually Edited Assemblies

Tcas5.1 is the output of Stitch (version 1.4.4) run for five iterations with two sets of alignment filters. To select quality alignments from regions of high and low label density, the first minimum confidence was 13 and the \(PAT\) was 30 and the second minimum confidence was 8 and the \(PAT\) was 90. The resulting super scaffolds showed a greater than three-fold increase in N50 from 1.16 to 3.85 Mb (Table 2).

The Tcas5.1 super scaffolds captured an additional 92 gaps between Tcas5.0 sequence scaffolds. Sixty-six gaps were estimated to have positive gap lengths and were represented in the sequence assembly with their estimated size (Figure 6). Twenty-six gaps were estimated to have negative lengths and were represented with spacers of 100 N’s (Figure 6). Extremely small negative gap lengths (\(< -20\) kb) were flagged for further evaluation at a sequence level. In some cases, extremely small negative gaps lengths suggest that a chimeric sequence scaffold may need to be broken at the sequence level and its fragments incorporated into different ChLGs. For example, half of scaffold 81 from Tcas5.0 aligns between scaffolds 80 and 82 on ChLG5 (Figure 7A) while the other half aligns between scaffolds 102 and 103 from ChLG7 (Figure 7B). Scaffold 81 from Tcas5.0 was placed in ChLG5 in the T. castaneum genetic map. The arrangement supported by the genetic map was selected for cases like this where manual editing was required. The manually curated assembly is referred to as Tcas5.2. In Tcas5.2, joins were also manually accepted if they agreed with the genetic map but the alignment quality was low. Ultimately, Tcas5.2 had a higher N50 than the automated Stitch output, 4.46 Mb (Table 2), with 66 gaps with positive estimated lengths and 24 negative length gaps.

Nearly every ChLG was less fragmented in the Tcas5.2 assembly than in the Tcas5.0 assembly. The number of scaffolds was reduced for each ChLG (Table 4) except for the 26 relatively short (N50 = 0.05 Mb) and unlocalized scaffolds from ChLGY. For example, ChLGX was reduced from 13 scaffolds (Figure 8B) to 2 in the final super scaffold or chromosome build (Figure 8A). Five scaffolds were reoriented in ChLGX based on alignment to the consensus genome maps (Figure 8A-B). Scaffolds were also reordered based on alignment to consensus genome maps. For example, scaffold 2 from ChLGX aligned between scaffold 36 and 37 of ChLG 3 and was therefore moved in Tcas5.2 (Additional file 5). Also, 19 previously unplaced scaffolds were anchored within a ChLG (Table 4). Improvements in Tcas5.2 over Tcas5.0 are shown in alignments of the respective in silico maps to consensus genome maps for all ChLGs in Additional file 5.

The Tcas5.2 Whole Genome Shotgun project has been deposited at DDBJ/EMBL/GenBank under the accession AAJJ00000000. The version described in this paper is version [DDBJ:AAJJ02000000,EMBL:AAJJ02000000,GenBank:AAJJ02000000]. Two scaffolds were removed from the genome assembly because they were identified as contaminants after they blasted to the Bos frontalis genome.

Assembly: Putative Haplotypes

Evidence of putative haplotypes was found during visual inspection of alignments. Although overall alignment redundancy was rare for the \(Strict\text{-}T\) assembly; when observed, it usually consisted of only two consensus genome maps aligning to the same in silico map (Figure 9). This redundancy might indicate collapsed repeats in the sequence-based assembly. On the other hand it could also indicate segmental duplications, assembly of alternative haplotypes, or mis-assembly producing redundant consensus genome maps. In Figure 9A, two consensus genome maps aligned to the same in silico map. One consensus genome map aligned across most of the in silico map while only a small region of the other consensus genome maps aligned to the the in silico map (Figure 9A). Molecule map coverage decreased for each consensus genome map in the region where only one of them aligned to the in silico map (Figure 9A-C). Taken together, the region of lower coverage and the number of consensus genome maps aligning (two) may indicate the assembly of two haplotypes.

Stitch: Comparison to Other Software

We also ran Hybrid Scaffold (BioNano version 3076) to improve the T. castaneum sequence assembly using the T. castaneum consensus genome maps and the in silico maps from Tcas5.0. Although Hybrid Scaffold improved the scaffold N50 of the sequence-based assembly from 1.16 Mb to 1.83 Mb, the increase in N50 was not as great as the increase after running Stitch (N50 = 3.85 Mb) (Table 2). The total number of scaffolds in the sequence assembly decreased by 30 after running Hybrid Scaffold; in comparison, Stitch reduced the number of scaffolds by 92. Stitch increased the length of the assembly by 4.98 Mb while Hybrid Scaffold increased assembly length by 14.80 Mb. The additional increase in length from Hybrid Scaffold is likely due to extension of sequence-based scaffolds with end gaps introduced from genome map contigs that overlap the start or end of a sequence-based scaffold.

Overall, we found Hybrid Scaffold to be more conservative than Stitch. For example in Figure 8B, the alignment of 13 in silico maps to three consensus genome maps was input into both Stitch and Hybrid Scaffold. In this alignment, the order of 11 of these in silico maps agreed with the order suggested by the genetic map (after reorienting three in silico maps). The in silico maps 12 and 13 aligned with a negative gap length between them, suggesting they may be mis-assembled or that local assembly may collapse the assembly in this region. There are several possible approaches when considering this kind of conflicting evidence. The approach of Stitch was to record that we have confirmed the relative position of these scaffolds in the larger context of the genome by creating a new super scaffold containing 100 bp spacer gaps to indicate that exact overlap or gap length is unclear (Figure 8A). Stitch also reports the negative gap length to indicate the need for further sequence level evaluation at a later date. Alternatively, Hybrid Scaffold only automates genome improvements that are unambiguously supported by all lines of evidence (sequence-based and genome map-based) and leaves any ambiguous decisions to a human curator. Therefore, Hybrid Scaffold would only report the conflicting alignment. This is why scaffold 13, for example, is not included in the hybrid map produced by Hybrid Scaffold (Figure 8C).

For highly refined genome assemblies, this lack of tolerance for noisy alignments has not hindered improvement of genome projects. However, for less refined genome assemblies this may be too stringent. The Hybrid Scaffold software, for example, was developed to scaffold the human genome and has been found to work well for this application. Genome projects at earlier stages would benefit from staged release of updates (e.g., immediate release of general improvement in scaffold order and orientation followed by further refinement at the sequence level). For projects such as T. castaneum, a more aggressive algorithm such as Stitch may be preferred in order to release the bulk of the new information about the higher order arrangement of the genome to the community of T. castaneum researchers quickly. Further refinement at a sequence level can be released in subsequent genome updates as they are completed.

Stitch: Assembly and Super Scaffolding With Multiple Genera

We examined experiments from 16 different genera to determine if the results seen for the Tribolium castaneum genome are typical for other genomes as well. The T. castaneum genome map N50 was found to be in the high end of the probability density distribution (Additional file 6; Section 1). The same is true for the Tcas5.0 draft sequence assembly N50 and percent of N50 improvement after super scaffolding compared to the other 17 of 19 total projects that had draft sequence genomes (Additional file 6; Section 1). However, in no case was the T. castaneum value the highest value recorded, suggesting that a wide range of output quality is possible including values better and worse than the output for T. castaneum.

We checked for evidence of correlations between a range of genomic metrics and map assembly, alignment or FASTA super scaffolding results. Because many of the genomic metrics had very broad ranges with variance that increased often for higher values the genomic metrics were log transformed to compress the upper tails and stretch the lower tails of the distributions.

Overall we found little correlation between either sequence FASTA N50, molecule map coverage or molecule map label density and final genome map N50. We did, however, find correlations between finished map and sequence assembly metrics and alignment and super scaffolding quality. There is a positive correlation between high value sequence assembly metrics and in silico map-to-genome map alignment metrics (Additional file 6; Sections 3-5) as well as post super scaffolding N50 improvement (Additional file 6; Sections 3,5). There is also a positive correlation between high value genome map assembly metrics and post super scaffolding N50 improvement (Additional file 6; Sections 3,5). However, no direct correlation was found between sequence assembly N50 and genome map N50 (Additional file 6; Sections 4-5). Taken together the analysis suggests different factors may determine sequence assembly and genome map assembly quality. Although sequence assembly N50 may not be useful to predict genome map N50, if both independent assemblies have high N50’s than more of the map lengths may align and super scaffolding may be more productive.

The low degree of correlation found between genome map N50 and sequence N50 may stem from steps unique to the molecule map imaging process. It might be expected that a genome with sequence that assembles well may have qualities that would also favor molecule map assembly (e.g. low repeat content, low ploidy, inbreed lines, etc.). However molecule map assembly is also influenced by unique factors like frequency of fragile sites (two labels occurring on opposite strands in close proximity), labeling efficiency and ability to extract high molecular weight DNA all of which vary for different organisms.

Principal component analysis suggests a negative correlation between labels per 100 kb and molecule coverage (Additional file 6: Sections 2-3). The correlation between labels per 100 kb and molecule coverage was weakly significant in individual regression (Additional file 6; Sections 4-5). Labels per 100 kb are monitored as molecules are being imaged. Lower than expected label density can occasionally lead to further labeling reactions or other adjustments to data collection and therefore greater depth of coverage.

Overall, comparison of the results for the T. castaneum genome and 19 additional genome projects suggest that results may vary widely from project to project. Many factors may contribute to this effect including the quality of the sequence assembly, degree of divergence between the organism or organisms used to extract DNA, success of extraction and labeling of high molecular weight DNA, genome size and genome complexity. In fact, the tendency for assemblies from the same genera or species to cluster together on the PCA plots suggests that organism-specific qualities may influence assembly, alignment or super scaffolding results (Additional file 6: Sections 2-3). Although analysis of more projects is needed to determine if these similarities are meaningful predictors of output quality.