Haplotype inference and population structure mapping
Haplotypes of genes were inferred following an expectation-maximization
algorithm (Bilmes, 1998; Dempster, Laird, & Rubin, 1977). We used an
in-house Perl script to perform this haplotype inference, employing
short reads to extract SNP linkage information (available from the above
GitHub repository). If two adjacent SNPs were not covered by any read
pair, we broke the gene into segments. In this case, the midpoint of the
two adjacent SNPs would be defined as the breakpoint of two consecutive
segments. Because the inference process uses a maximum likelihood method
to compare haplotype alternatives, it is prone to yield short segments
when a large number of populations is considered. Therefore, we selected
eight populations representing different varieties and different regions
for inferring haplotypes: two A. m. eucalyptifolia (euCA and
euDW), two A. m. australasica (auAK and auBS), and four A.
m. marina (maBB, maLS, maTN, and maSY). Finally, genes were split into
454 linked segments and haplotypes were inferred for each segment
(Supplementary Table 3). Before constructing haplotype networks, we
filtered segments with length less than 100bp or with missing data. For
each of the 231 retained segments, we computed a haplotype network using
the NETWORK software (Polzin & Daneshmand, 2003).