3.2 K-mer analysis and genome assembly
Before k-mer analysis, Illumina reads from the survey analysis were mapped to the Nucleotide Sequence Database (NT); 86.38% of reads were successfully matched, indicating that the data were credible for further analysis. The k-mer number of 187,447,882,456 was screened out by filtering of abnormal k-mers. The k-mer depth of 115 was the main peak in the plot (Supporting Information Figure S2), and the genome size was calculated as ~ 1.64 Gb according to the k-mer formula. The k-mer depths of 58 and 230 represented beginning locations in the computation of heterozygous and repetitive sequences, respectively. Finally, the clam genome was estimated to have a heterozygosity rate of 2.41% and a repeat ratio of 64.55%. It was deemed to be a large complex genome with high heterozygosity and a high level of repetition.
The initial filtered PacBio subreads were subjected to error correction by Canu, resulting in 15,031,088 subreads generated for subsequent assembly. The number of contigs obtained by Canu and SMARTdenovo with the polish by Racon and Pilon was 4,347. The analysis finally resulted in an Asian Clam genome of 1.52 Gb with a contig N50 of 603.64 Kb. The size of the Asian Clam genome for PacBio assembly was slightly smaller than that estimated by k-mer analysis (1.64 Gb), which was in line with the regularity. It indicates that we have captured and assembled most of the sequences of Asian Clam genome. The accuracy of the sequence needs to be verified by Hi-C technology.