Haplotype inference to illustrate genomic divergence
The portion of the genome unaffected by gene flow increases as
speciation proceeds (Feder, Egan, & Nosil, 2012; Feder, Flaxman, Egan,
Comeault, & Nosil, 2013; Nadeau et al., 2013; Wu, 2001; Wu & Ting,
2004). As subspecies are somewhere in the speciation continuum, how is
differentiation distributed across the genome? The pattern can be
visualized by inferring haplotypes of loci and comparing the haplotype
networks. The method developed by He et al. (2019) was used to infer
haplotypes. This method uses SNP linkage information in each short-read
pair to infer haplotypes and frequency of each haplotype in the
population, following an expectation-maximization algorithm (Bilmes,
1998; Dempster, Laird, & Rubin, 1977). If two adjacent SNPs were not
covered by any read pair, we broke the gene into segments. In this case,
the midpoint of the two adjacent SNPs is defined as the breakpoint of
two consecutive segments. The accuracy of this method in inferring
haplotypes has been validated by sequencing individuals using the Sanger
method (He et al., 2019). We selected eight populations representing
different subspecies and different regions for inferring haplotypes: twoeucalyptifolia (CA and DW), two australasica (AK and BS),
and four marina populations (BB, LS, TN, and SY). Genes were
split into 454 linked segments and haplotypes were inferred for each
segment (Table S2). Before constructing haplotype networks, we filtered
out segments with length less than 100 bps or with missing data. For
each of the 231 retained segments, we computed a haplotype network using
the NETWORK software (Polzin & Daneshmand, 2003). For some segments,
the sequences were blasted against the database of National Center for
Biotechnology Information for function annotation.