ROUGH DRAFT authorea.com/39331

# Tools and pipelines for BioNano data: molecule assembly pipeline and FASTA super scaffolding tool

Abstract

Background: Genome assembly remains an unsolved problem. Assembly projects face a range of hurdles that confound assembly. Thus a variety of tools and approaches are needed to improve draft genomes.

Results: We used a custom assembly workflow to optimize consensus genome map assembly, resulting in an assembly equal to the estimated length of the Tribolium castaneum genome and with an N50 of more than 1 Mb. We used this map for super scaffolding the T. castaneum sequence assembly, more than tripling its N50 with the program Stitch.

Conclusions: In this article we present software that leverages consensus genome maps assembled from extremely long single molecule maps to increase the contiguity of sequence assemblies. We report the results of applying these tools to validate and improve a 7x Sanger draft of the T. castaneum genome.

Keywords: Genome map; BioNano; Genome scaffolding; Genome validation; Genome finishing

# Background

The quality and contiguity of genome assemblies, which impacts downstream analysis, varies greatly (Salzberg 2012, Vezzi 2012, Bradnam 2013). Initial assembly drafts, whether based on lower coverage Sanger or higher coverage NGS reads, are often highly fragmented. Physical maps of BAC clones can be used to validate and scaffold sequence assemblies, but the molecular, human, and computational resources required to significantly improve a draft genome are often not available to researchers working on non-model organisms. The BioNano Irys System linearizes and images nicked and fluorescently labeled long DNA strands to generate single molecule physical maps. The Irys System provides affordable, high throughput physical maps of significantly higher contiguity with which to validate draft assemblies and extend scaffolds (Das 2010).

Genome assembly and scaffolding algorithms are inherently limited by the length of the DNA molecules used as starting material to generate data. Specifically, if repetitive, polymorphic or low complexity regions are longer than the single molecules used to generate data, then they cannot be resolved by bioinformatics tools with certainty. The specifications for PacBio P6-C4 chemistry (PacBio P6-C4 chemistr...) indicate that PacBio reads have an N50 of 14 kb with a maximum length of 40 kb. Illumina Long Distance Jump Libraries can also span 40 kb (Illumina Long Distanc...). MinION nanopore sequence reads have an average read length of $$< 7$$ kb (Quick 2014). Illumina TruSeq synthetic long-reads can span up to 18.5 kb; however, they fail to assemble if the sequence has problematic regions longer than the component reads used to assemble the synthetic reads (e.g. in the heterochromatin) (McCoy 2014). The OpGen Argus (Teague 2012) platform produces optical maps that have a length of 150 kb to 2 Mb from up to 13 Gb data collected per run (Opgen Argus MapCard s...). The Irys System from BioNano Genomics produces single molecule maps that have an average length of 225 kb from up to 96 Gb data collected per run after filtering for molecules $$<$$ 150 kb (BioNano IrysChips spe...). Genomic repeats can be much longer than the 5-40 kb that many technologies can span with a single molecule. In fact, a recent study used consensus genome maps (assembled from single molecule maps $$>$$ 150 kb) to identify repeats that are hundreds of kb in the human genome (Cao 2014).

Sequence-based assembly methods are fraught with platform-specific error profiles (e.g. resolving homopolyer repeats or read-position effects on base quality) (Ross 2013). Map-based approaches offer an orthogonal genomic resource that complements sequence-based approaches but not their error profiles. For example, map-based error profiles tend to consider errors in estimated molecule or fragment length and errors associated with restriction sites that are too close together, neither of which influence sequence-based approaches (Mendelowitz 2014, Cao 2014). Both the BioNano Irys System and OpGen Argos platform provide single molecule maps from genomic DNA. OpGen may provide higher resolution maps by using enzymes with a six rather than a seven base pair recognition site, but BioNano’s single molecule maps still deliver a more efficient and affordable method for generating whole genome maps.

## Data formats

The tools described make use of three file formats developed by BioNano. The Irys System images extremely long molecules of genomic DNA that are nick-labeled at seven bp motifs using one or more nicking endonucleases and fluorescently labeled nucleotides. Molecules captured in TIFF images are converted to BNX format text files that describe the detected label position for each molecule (Figure 1(1-2)). The individual molecules described in BNX files are referred to as single molecule maps. Consensus Map (CMAP) files include the molecule map lengths and label positions for long genomic regions that are either inferred from assembly of raw single molecule maps (Figure 1(7-8)) or in silico from sequence scaffolds (Figure 1(3-4)). Individual maps in these two types of CMAP files are referred to as a consensus genome map or an in silico map, respectively. The alignment between two CMAP files is stored as an XMAP file that includes alignment coordinates and an alignment confidence score (Figure 1(10)).

## Other software tools for scaffolding with BioNano data

BioNano Genomics developed the Hybrid Scaffold tool to create more contiguous consensus genome maps using information from both sequence and BioNano genome map data. These more contiguous maps can then be used to create more contiguous sequence assemblies. The Hybrid Scaffold software first creates hybrid in silico/consensus genome map contigs based on an alignment between the two. The output genome maps are called hybrid scaffolds and are aligned to the original in silico maps. This alignment is used to output a FASTA file of sequence super scaffolds. These sequences include seven base pair ambiguous-base motifs to indicate where labels occur within gaps. Because they extend into regions with consensus genome maps but without sequence data they may begin or end with gaps. The Hybrid Scaffold program only generates hybrid in silico/consensus genome map contigs, and therefore super scaffolds, if no conflicts (e.g. negative gap lengths or otherwise conflicting alignments) are indicated in the alignment of in silico and consensus genome maps. In this conservative approach, all conflicting alignments are excluded from the hybrid scaffold genome map and flagged for further evaluation at the sequence level.

## Motivation

We designed tools and workflows to optimize the use of single molecule maps in the construction of whole genome maps and then use the best resultant consensus genome maps to improve contiguity of draft genome sequence assemblies. Single-molecule maps were assembled into BioNano genome maps de novo using software tools developed at BioNano (Cao 2014). As with sequence-based assembly algorithms, it was noted that testing a range of assembly parameters can improve final assembly quality for the BioNano Assembler. Additionally, applying error correction to molecule map stretch was found to improve assembly quality. Therefore, we created AssembleIrysCluster to normalize molecule map stretch and automate the writing of assembly scripts that use various parameters. We created the Stitch tool to super scaffold sequence-based assemblies using alignments to the optimal BioNano genome map. The Hybrid Scaffold and Stitch tools for genome finishing both take alignments from the BioNano RefAligner as input. Both tools were developed simultaneously but were ultimately found to be useful for distinct applications. We validated AssembleIrysCluster and Stitch using the Tribolium castaneum genome (Richards 2008) because this project has genetic map resources (Lorenzen 2005) that offer independent corroboration. Genetic maps were not used as input for Stitch. In this case, the super scaffolds created by Stitch were compared to the order of scaffolds within ChLGs predicted by the genetic map.