2.6 Gene family analysis
The protein sequences of 11 species in Diptera were selected for phylogenetic analysis, including 6 species of family Chironomidae (C. striatipennis , Chironomus riparius, Chironomus tentans, Clunio marinus , Polypedilum vanderplanki, Propsilocerus akamusi ), 3 species of family Culicidae (Anopheles gambiae, Anopheles sinensis, Culex quinquefasciatus ), one species of family Drosophilidae (Drosophila melanogaster ) and one species of family Muscidae (Musca domestica ) (Supplementary Table 4, 7). For further analysis, the script was used to extract the longest transcript of each gene; Orthofinder v2.2.6 was employed to identify gene family clusters (Emms & Kelly, 2015, 2019).
Multiple sequence alignment of single copy gene families generated from Orthofinder were performed to infer phylogeny of above 11 species by Mafft v7.407 (Rozewicki et al., 2019). Protein-aligned sequences were translated into coding sequences (CDS) and further optimized by Gblocks 0.91b (Castresana, 2000). The optimization results were connected into super gene and put into IQTREE v1.5.5 to construct phylogenetic tree (Nguyen et al., 2015). The divergence time was estimated by MCMCTREE in PAML package. The standard divergence time was obtained by Timetree (Yang, 1997). Based on the results of gene family clustering and phylogenetic tree, expansion and contraction of gene families were inferred. The significance of each expanded and contracted gene family was evaluated by CAFÉ v4.2 (De Bie et al., 2006). The KEGG annotation of gene families was performed using the same method as gene function annotation. Homologous gene pairs in the sequence were sought by BLAST (Kent, 2002). Colinear regions were recognized by McscanX (Wang et al., 2012).