Genome assembly and annotation
The original draft genomes used paired-end and mate-pair Illumina library sequencing (Keeling et al., 2013c). We made substantial improvements to these assemblies with proximity ligation-based scaffolding with HiRise; linkage-map/ALLMAPS-informed corrections and scaffolding; further improvements with LINKS, RAILS, and ABySS-Sealer; and PhylOligo-based removal of contaminant scaffolds (Fig. 1, Tab. 1). All of these tools were developed after the original draft assemblies were prepared. A comparison of scaffold sizes between draft and final genome assemblies is shown in Supp. Fig. 2. The final female and male genome assembly sizes were 223.7 and 224.8 Mb, with N50s/L50s of 16.6 Mb/4 and 16.4 Mb/4, respectively. Gregory et al. (2013) used flow cytometry to estimate a 208 Mb genome size. The non-N portions of the genome assemblies were very similar to this value, 214.0 Mb for the female assembly and 210.5 Mb for the male assembly. Compared to the draft assemblies, N50 values increased by 26- and 36-fold, and the number of scaffolds decreased by 67 and 75 percent, respectively. Ninety percent of each assembly was contained in the largest 12 (female) and 11 (male) scaffolds. Based on linkage mapping information, these 12 largest scaffolds in the female assembly represent the karyotype of this species (11 AA + neo-XX). The male assembly did not contain a large scaffold representing the neo-Y chromosome.
Each step in the assembly process contributed to the improved assemblies, and incremental assembly statistics at each step are shown in Supp. Tab. 1. Chicago HiRise scaffolding dramatically increased contiguity, reducing the number of scaffolds by 56-66%. Hi-C HiRise scaffolding reduced the number of scaffolds by an additional 21%. The linkage map information allowed us to correct misjoins in the HiRise assemblies and join additional scaffolds. Visualization of the linkage map information with ALLMAPS allowed us to identify several instances where scaffolds from the Chicago HiRise step were flipped and/or out-of-order with adjacent scaffolds compared to the linkage map information and the assembly from the other sex when they were scaffolded at the Hi-C HiRise step, even though both assemblies were based upon the same scaffolding information. An example is shown in Fig. 2. In total, nine of the twelve largest scaffolds were modified (Supp. Fig. 3).
In one case only, a scaffold from the draft male assembly was flipped and misplaced during the earlier Chicago HiRise step. Based on linkage map information, ALLMAPS joined three scaffolds to make the neo-X in the female assembly, and four scaffolds to make the neo-X in the male assembly. This made the neo-X scaffold the largest scaffold in both final assemblies. The LINKS scaffolding step made only two and six joins, the RAILS step made eight and eleven joins while also filling in 18% and 9% of the existing gaps within scaffolds, and ABySS-Sealer filled in 38% and 47% of the remaining gaps of the female and male genomes, respectively. We then identified and removed contaminant scaffolds with PhylOligo. These contaminant scaffolds from the female and male assemblies matched most similarly to Serratia spp. andAcinetobacter spp., respectively. Both of these genera in the Gammaproteobacteria have been found in the bark beetle gut bacteriome (Hernández-García et al., 2017). The final assemblies showed good consistency between sexes in both shared synteny and chromosomal arrangement (Supp. Fig. 4), and also contained 95% of the 1367 Insecta orthologous gene set (Insecta_odb10, Creation date: 2020-09-10, Supp. Fig. 5).
To annotate the genome, we used evidence from coleopteran proteins andDendroctonus spp. transcripts, with ab initio methods for gene prediction with three rounds of Maker3. We identified 13 393 and 13 601 gene models in the female and male genomes, respectively. This represents approximately a 4% increase from the original draft genome annotations. These gene models contained 91% of the Insecta orthologous gene set (Supp. Fig. 5) and 74% shared significant homology to proteins in the UniProtKB/Swiss-Prot 2020_01 database. Repetitive elements occupied approximately 23% and 20% of the female and male genome assemblies, respectively (Supp. Tab. 2).