2.3 Hi-C scaffolding
The contigs generated by the preliminary genome assembly required filling of gaps and anchoring on the putative chromosomes. The initial contigs were piped into the Hi-C assembly workflow, and the signals of chromatin interactions were captured to construct chromosomes. In brief, the putative Hi-C junctions were aligned by the unique mapped read pairs using BWA-MEM (version 0.7.10-r789) (Li & Durbin, 2009). The paired reads uniquely mapped to the assembly were called the valid interaction pairs, and they were used for the Hi-C scaffolding. Other invalid reads included reads of self-ligation and non-ligation; dangling ends were filtered out using HiC-Pro (version 2.10.0) (Servant, et al., 2015).
The Hi-C reassembly broke the contigs into 50 kb fragments, and the regions that were mismatched to the initial assembly or could not be restored were listed as candidate error areas. The genome was subjected to a final round of error correction, and the gaps were filled during this round. The reassembled and corrected contigs were divided into ordered, oriented, and anchored groups by LACHESIS (Burton et al., 2013) with the parameters CLUSTER_MIN_RE_SITES = 33; CLUSTER_MAX_LINK_DENSITY = 2; CLUSTER_NONINFORMATIVE_RATIO = 2; ORDER_MIN_N_RES_IN_TRUN = 29, and ORDER_MIN_N_RES_IN_SHREDS = 29, automatically resulting in putative chromosomes. The gaps generated during the Hi-C assembly were refilled using LR GapCloser (version 1.1) (Xu et al., 2019).
2.4 Genome quality evaluation and repeats analysis
The genome of C. fluminea was aligned to the actinopterygii database (odb9) comprising 978 conservative core genes by BUSCO (version 3.0) (Simao, Waterhouse, Ioannidis, Kriventseva, & Zdobnov, 2015). The eukaryotic conserved genes for the clam were searched in the database to evaluate the completeness of the genome. The CEGMA Database comprising 458 conserved core genes of eukaryotes was searched in the same way using CEGMA (version 2.5) (Parra, Bradnam, & Korf, 2007). Additionally, another evaluation was applied to the Illumina short-read alignments to map to the assembled genome of the clam using BWA-MEM (version 0.7.10-r789) (Li & Durbin, 2009).
There are two main types of repeats, retrotransposons (Class I in our analysis) and transposons (Class II in our analysis). We constructed a specific repeats database for repeat prediction using LTR-FINDER (version 1.05) (Xu & Wang, 2007) and RepeatScout (version 1.0.5) (Price, Jones, & Pevzner, 2005) followed by the identification and classification for repeats by PASTEClassifer (version 1.0) (Hoede et al., 2014). The species-specific repeats library for the clam genome was successfully generated by aggregating our prediction and Repbase (19.06) (Bao, Kojima, & Kohany, 2015). LTR characteristics for the clam were processed by RepeatMasker (version 4.0.6) (Tarailo-Graovac & Chen, 2009).