1 | INTRODUCTION
Advances made in DNA sequencing during the past decade, has led to genomes of diverse organisms being successfully sequenced and assembled (de Man et al., 2016; Iorizzo et al., 2016; Jarvis et al., 2017; Lien et al., 2016). High-quality genome assembly requires high levels of contiguity, which enable new insights into genome structure evolution and increase the gene space completeness of the assembly (Berlin et al., 2015; Gordon et al., 2016; Koren et al., 2013; Loman, Quick, & Simpson, 2015). However, the presence of repetitive regions in a genome poses a major challenge to the assembling of highly contiguous genomes. Mate-pair sequencing involves the generation of long-insert paired-end DNA libraries that span several kilobase pairs of long repeat regions. This is useful for many sequencing applications, including de novo sequencing, genome finishing, structural variant detection, and identification of complex genomic rearrangements (Maretty et al., 2017; Smadbeck et al., 2018; Tan, Tan, & Cheng, 2020; van Heesch et al., 2013; Wetzel, Kingsford, & Pop, 2011). During mate-pair library preparation, DNA is fragmented allowing DNA of a desired length to be isolated. Afterwards, the ends of the DNA fragments are biotinylated and circularized. Then, the DNA ring is sheared into smaller fragments (400-600 bp). Biotinylated fragments are enriched (by biotin tag), and adapters ligated. These are then ready for cluster generation and sequencing. Although this technology does not produce long reads, it is able to span repeat regions if the insert size is sufficiently large. Combining data generated from mate-pair library sequencing with those from short-insert paired-end reads provides a powerful combination of read lengths for maximal sequencing coverage across the genome, leading to a dramatic improvement in the assembly of large genomes. Mate pairs with small, medium, and large insert sizes are usually used to scaffold contigs in order to improve genome assemblies (Pop, Phillippy, Delcher, & Salzberg, 2004).
Third-generation long-read sequencing technologies, such as PacBio (Rhoads & Au, 2015) and Nanopore, (Jain, Olsen, Paten, & Akeson, 2016), increase read lengths to overcome the challenge of sequencing repetitive regions that reads must be long enough to anchor in nonrepetitive sequences and span across the repeats. Repeats may be spanned, and subsequent assembling of the region is possible if the read length is substantially longer than the repeat region (Bongartz, 2019). Third-generation long reads are also used for scaffolding during genome assembly (Boetzer & Pirovano, 2014).
High-quality DNA, which is crucial for mate-pair sequencing, can only be obtained from material that is both fresh and abundant. Similarly, high-molecular-weight DNA (>50 kb) is needed to realize the full beneficial effects of potential third-generation sequencing. The lack of suitable starting material limits the choice of sequencing technology and affects the quality of the obtained data. For example, in a comparative genomics study of ruminants, only the genomes of several species, such as mountain nyala, common eland, bongo, and oribi could be assembled at the contig level due to degenerate DNA samples, which were not suitable for constructing mate pair libraries (Chen et al., 2019). Another example of poor DNA involves studies of ancient DNA (aDNA) (Stoneking & Krause, 2011) which mostly contains very short fragments between 44 and 172 bp (Sawyer, Krause, Guschanski, Savolainen, & Paabo, 2012).
Although it is impossible to apply mate-pair or third generation sequencing to degenerate or ancient samples, (Grau, Hackl, Koepfli, & Hofreiter, 2018) invented a method that generates in silicomate-pair libraries using a reference genome from a closely related species, thereby helping to assemble genomes at the scaffold level. In order to improve genome contiguity, they developed cross-species scaffolding — a new pipeline that imports long-range distance information directly into a de novo assembly process by constructing mate-pair libraries in silico . After processing, cleaned reads of target species were mapped to the repeat-masked reference genome, and consensus is computed. Next, read pairs of mate-pair libraries are generated based on consensus. Finally, the cleaned reads and in silico mate pairs are used to assemble the genome using SOAPdenovo2 (Luo et al., 2012). Application of thisin silico mate-pair method resulted in a dramatic improvement in contiguity and accuracy, as demonstrated by the assembling of two primate genomes, based on just ∼30x coverage of shotgun sequencing data (Grau et al., 2018). A drawback of this approach is the introduction of assembly chimeras (Grau et al., 2018). Furthermore, phylogenetic distance, quality, and completeness of the reference genome, as well as its overall synteny and transposable element content, influence the final number of misassemblies. Methods via which misassemblies can be reduced and best references can be chosen to generate in silicomate pairs are yet to be tested.
In addition to the in silico mate-pair method, referred to as the reference-guided approach, similarity between the target and reference species can also be made use of to gain additional information, which often leads to more complete and improved genome assemblies (Bao, Jiang, & Girke, 2014; Pop et al., 2004; Schneeberger et al., 2011). In contrast to the in silico method that generates mate pairs prior to genome assembly, reference guide approaches, such as Chromosomer (Tamazian et al., 2016), Ragout (Kolmogorov, Raney, Paten, & Pham, 2014), and RaGOO (Alonge et al., 2019) , use a single reference to order, orientate, and join contigs and long reads. Therefore, thein silico mate-pair method is more flexible than the reference guide approach. For example, high-quality, conserved mate pairs can be selected by comparing two or more reference genomes to reduce misassemblies in the target genome assembly.
In this study, we attempted to optimize the use of the in silicomethod. First, we investigated how the phylogenetic distance between a reference and a target affects the quality of genome assembly. We then tested whether generating a conserved mate pair by comparing multiple reference genomes improves the quality of genome assembly. Finally, we tested the effect of the optimized in silico mate-pair strategy on degraded samples and a simulated ancient DNA data.