3.2 K-mer analysis and genome assembly
Before k-mer analysis, Illumina reads from the survey analysis were
mapped to the Nucleotide Sequence Database (NT); 86.38% of reads were
successfully matched, indicating that the data were credible for further
analysis. The k-mer number of 187,447,882,456 was screened out by
filtering of abnormal k-mers. The k-mer depth of 115 was the main peak
in the plot (Supporting Information Figure S2), and the genome size was
calculated as ~ 1.64 Gb according to the k-mer formula.
The k-mer depths of 58 and 230 represented beginning locations in the
computation of heterozygous and repetitive sequences, respectively.
Finally, the clam genome was estimated to have a heterozygosity rate of
2.41% and a repeat ratio of 64.55%. It was deemed to be a large
complex genome with high heterozygosity and a high level of repetition.
The initial filtered PacBio subreads were subjected to error correction
by Canu, resulting in 15,031,088 subreads generated for subsequent
assembly. The number of contigs obtained by Canu and SMARTdenovo with
the polish by Racon and Pilon was 4,347. The analysis finally resulted
in an Asian Clam genome of 1.52 Gb with a contig N50 of 603.64 Kb. The
size of the Asian Clam genome for PacBio assembly was slightly smaller
than that estimated by k-mer analysis (1.64 Gb), which was in line with
the regularity. It indicates that we have captured and assembled most of
the sequences of Asian Clam genome. The accuracy of the sequence needs
to be verified by Hi-C technology.