2.3 Hi-C scaffolding
The contigs generated by the preliminary genome assembly required
filling of gaps and anchoring on the putative chromosomes. The initial
contigs were piped into the Hi-C assembly workflow, and the signals of
chromatin interactions were captured to construct chromosomes. In brief,
the putative Hi-C junctions were aligned by the unique mapped read pairs
using BWA-MEM (version 0.7.10-r789) (Li & Durbin, 2009). The paired
reads uniquely mapped to the assembly were called the valid interaction
pairs, and they were used for the Hi-C scaffolding. Other invalid reads
included reads of self-ligation and non-ligation; dangling ends were
filtered out using HiC-Pro (version 2.10.0) (Servant, et al., 2015).
The Hi-C reassembly broke the contigs into 50 kb fragments, and the
regions that were mismatched to the initial assembly or could not be
restored were listed as candidate error areas. The genome was subjected
to a final round of error correction, and the gaps were filled during
this round. The reassembled and corrected contigs were divided into
ordered, oriented, and anchored groups by LACHESIS (Burton et al., 2013)
with the parameters CLUSTER_MIN_RE_SITES = 33;
CLUSTER_MAX_LINK_DENSITY = 2; CLUSTER_NONINFORMATIVE_RATIO = 2;
ORDER_MIN_N_RES_IN_TRUN = 29, and ORDER_MIN_N_RES_IN_SHREDS =
29, automatically resulting in putative chromosomes. The gaps generated
during the Hi-C assembly were refilled using LR GapCloser (version 1.1)
(Xu et al., 2019).
2.4 Genome quality evaluation and repeats analysis
The genome of C. fluminea was aligned to the actinopterygii
database (odb9) comprising 978 conservative core genes by BUSCO (version
3.0) (Simao, Waterhouse, Ioannidis, Kriventseva, & Zdobnov, 2015). The
eukaryotic conserved genes for the clam were searched in the database to
evaluate the completeness of the genome. The CEGMA Database comprising
458 conserved core genes of eukaryotes was searched in the same way
using CEGMA (version 2.5) (Parra, Bradnam, & Korf, 2007). Additionally,
another evaluation was applied to the Illumina short-read alignments to
map to the assembled genome of the clam using BWA-MEM (version
0.7.10-r789) (Li & Durbin, 2009).
There are two main types of repeats, retrotransposons (Class I in our
analysis) and transposons (Class II in our analysis). We constructed a
specific repeats database for repeat prediction using LTR-FINDER
(version 1.05) (Xu & Wang, 2007) and RepeatScout (version 1.0.5)
(Price, Jones, & Pevzner, 2005) followed by the identification and
classification for repeats by PASTEClassifer (version 1.0) (Hoede et
al., 2014). The species-specific repeats library for the clam genome was
successfully generated by aggregating our prediction and Repbase (19.06)
(Bao, Kojima, & Kohany, 2015). LTR characteristics for the clam were
processed by RepeatMasker (version 4.0.6) (Tarailo-Graovac & Chen,
2009).