Supplemental materials for: Ebola virus epidemiology, transmission, and evolution during seven months in Sierra Leone

#Extended Experimental Procedures

##Ethical and Safety Approvals

This work has been approved by Institutional Review Boards in Sierra Leone (Sierra Leone Ethics and Scientific Review Committee, SLESRC) and the United States (Harvard Committee on the Use of Human Subjects, CUHS, the CDC’s Human Research Protection Office, HSPO). As part of the EVD outbreak response and surveillance efforts, residual human clinical samples were collected under a waiver of consent granted by SLESRC and CUHS, and the EBOV sequencing work has received non-human subjects research determination by CUHS and HSPO. The Sierra Leone Ministry of Health and Sanitation approved shipment of non-infectious, inactivated samples collected from EVD patients to Broad Institute and Harvard University for viral sequencing. The EBOV-related research and laboratory safety protocols are registered with the Committee of Microbiological Safety (COMS) at Harvard University, and the viral sequencing work is registered with the Institutional Biosafety Committee at Broad Institute. All work with infectious or potentially infectious material was performed at the CDC Viral Special Pathogens Branch in Atlanta, GA, under biosafety level 4 (BSL-4) conditions. Our work was not deemed to be dual-use research of concern.

##Ebola Virus Makona Genome Assembly and Analysis

The viral assembly pipeline began by depleting paired-end reads from each sample of human and other contaminants using best match tagger (BMTagger) (Kirill Rotmistrovsky, Richa Agarwala, BMTagger: Best Match Tagger for removing human reads from metagenomics datasets, 2011. ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/bmtagger/) and the nucleotide basic local alignment search tool (BLASTN) (Altschul et al., 1990). PCR duplicates were removed using a custom modification to Vicuna, M-Vicuna (a custom modification to Vicuna, Yang et al., 2012). The resulting "de-identified" metagenomic datasets were deposited in sequence read archive (SRA, BioProject IDs PRJNA257197 and PRJNA283385). Next, reads were filtered to all members of the Ebolavirus genus (all ebolaviruses including EBOV) using LASTAL (Kiełbasa et al., 2011), quality-trimmed with Trimmomatic (Bolger et al., 2014), and further de-duplicated with PRINSEQ (Schmieder et al., 2011).

The filtered and trimmed reads were subsampled to 100,000 pairs, if available, and de novo assembled using Trinity (Grabherr et al., 2011). Subsequently, reference-assisted assembly improvements (contig scaffolding, gap-filling, etc.) were performed with the Viral Finishing and Annotation Toolkit (V-FAT, http://www.broadinstitute.org/scientific-community/science/projects/viral-genomics/v-fat), which relies on MOSAIK (Lee et al., 2014) and multiple sequence comparison by log expectation (MUSCLE) (Edgar, 2004). Each sample's reads were aligned to its de novo assembly using Novoalign (http://novocraft.com/products/novoalign/), and any remaining duplicates were removed using Picard with MarkDuplicates command (http://broadinstitute.github.io/picard). Variant positions in each assembly were identified using genome analysis toolkit (GATK, McKenna et al., 2010) insertions and deletions realinger (IndelRealigner) and UnifiedGenotyper (DePristo 2011, Van der Auwera 2013) on the read alignments. The assembly was refined to represent the major allele at each variant site, and any positions supported by fewer than three reads were changed to N (4-way ambiguity). This align-call-refine cycle was iterated twice, to minimize reference bias in the assembly.

Intrahost variants (iSNVs) were called from each sample's read alignments using V-Phaser2 (Yang et al., 2013) and subjected to an initial set of filters: variant calls with fewer than five forward or reverse reads or more than a 10-fold strand bias were eliminated. iSNVs were also removed if there was more than a 5-fold difference between the strand bias of the variant call and the strand bias of the reference call. Variant calls that passed these filters were additionally subjected to a 0.5% frequency filter. The final list of iSNVs contains only variant calls that passed all filters in two separate library preparations. Annotated iSNV calls are available in variant call format (VCF) and tabular formats (Data S1). These files infer 100% allele frequencies for all samples at an iSNV position without intrahost variation within the sample, but a clear consensus call during assembly. Annotations were computed with the effect of single nucleotide polymorphisms (SnpEff) program (Cingolani et al., 2012).

Our Linux-based software pipeline is publicly available at https://github.com/broadinstitute/viral-ngs (Park et al., 2015). This pipeline includes command-line tools for each of the above steps and optional Snakemake workflows (Koster et al., 2012) to automate them either sequentially or in parallel. Most of the third-party tools used are either included or can be downloaded and installed automatically, except for GATK and Novoalign, which must be provided by the user due to licensing restrictions.

Molecular Evolution

Synonymous and nonsynonymous counts were mapped onto the molecular phylogenies using robust counting (O’Brien et al., 2009; Lemey et al., 2012) by specifying independent Hasegawa, Kishno, Yano (HKY) nucleotide substitution models (Hasegawa et al., 1985) for all 3 codon-position partitions. Substitutions in intergenic regions were modeled according to HKY with \(\Gamma_{4}\)-distributed rate heterogeneity (Hasegawa et al., 1985; Yang, 1994). A relaxed molecular clock with log-normal rate distribution categories (Drummond et al., 2006) and a non-parametric Bayesian skygrid (Gill et al., 2012) tree prior were used. A reference prior (Ferreira et al., 2008) was used on the molecular clock.

We estimated the ratio of nonsynonymous substitutions over synonymous substitutions, \(d_N/d_S\) or \(\omega\) in every gene of EBOV (NP, VP35, VP40, VP30, VP24 and L), using an implementation of the Goldman et al. (1994) codon model in BEAST (Drummond et al., 2012). We used the same sequences as the analysis above, but excluded sequences of potentially lower quality, resulting in 314 EBOV Makona genomes. GP-gene coding sequences were split into the mucin-like domain (GP1\(_{MLD}\)), which encompasses amino acid residues 313-464 (Lee et al., 2008) starting from methionine of GP1, and the rest of GP1,2 (GP\(_{\Delta MLD}\)). This split is due due to concern that the GP\(_{MLD}\) is highly disorganized (Lee et al., 2009) and thus is under little constraint at the amino acid level. To date, only linear epitopes in GP\(_{MLD}\) are known to be targeted by antibodies (Olal et al., 2012), due to its extensive O-, and N-linked glycosylation. We employed independent codon models for all 8 partitions, parameterized with independent strict molecular clocks. A reference prior (Ferreira et al., 2008) was used on the evolutionary rate. Substitutions in the ninth partition, with concatenated noncoding intergenic regions, was modeled using the HKY+\(\Gamma_{4}\) (Hasegawa et al., 1985; Yang, 1994) model. The non-parametric Bayesian skygrid was used as the tree prior (Gill et al., 2012) for both long-term and current datasets.

All BEAST analyses (inputs, outputs, scripts) are made available (Data S2).

#Supplemental Figures

Figure S1 - Phylogenetic and Temporal Context of Recent Tong, et al Samples. (A) 175 recently published Ebola virus Makona samples from Sierra Leone (Tong et al., 2015) describe lineages that fall within the genetic diversity of our current data set (MCC tree from BEAST, as in Figure 1). (B) They span a two month period (Sep 28 to Nov 11, 2014) that falls within the temporal sampling of our current data and shows a consistent evolutionary rate. Related to Figure 1.

Figure S2 - Tracing Historical Ebola Virus Makona Migrations from East to West. Related to Figure 1. (A) Nine Ebola virus (EBOV) Makona genomes (right-hand most circles) from the Freetown area with four groups of apparently ancestral EBOV genomes (middle circles)). Groups of genetically identical genomes (circles) are related to each other by simple vertical relationships (arrows). Solid circles are shown on the date of the earliest sample in the group; the circle area is proportional to the number of samples containing viruses with that genome; arrows represent a set of non-homoplasic SNPs and point from ancestral to derived alleles. Here, "SL3" and "SL4" do not refer to entire clades, but to the viruses that exactly match the canonical SL3 and SL4 genomes with no further mutations. (B) Geographic mapping of one epidemiological route that may account for four of the nine Freetown viruses shown in (A). Groups of identical viruses are shown at their first observed location.

Figure S3 - Ebola Virus Makona Intrahost Single-Nucleotide Variants (iSNVs). Related to Figure 2. (A) Distribution of the number of iSNVs per sample. Replicate sequencing and iSNV calling was completed for 150 samples, of which 65 had no iSNV calls. Mean iSNVs per sample (including samples without iSNVs) = 2.04; mean iSNVs per sample (among samples with iSNVs) = 3.6. (B) Sample coverage by date shows the temporal distribution of samples containing Ebola virus (EBOV) genomes with and without iSNV calls. As expected, samples with iSNV calls have generally higher coverage. (C) Intermediate-frequency variants can persist over time with minimal genetic drift, as demonstrated by the iSNV at position 18,911. The existence of intermediate frequency (10–30%) iSNVs in many different samples over time provides an argument against recurring mutations and may suggest a relatively wide transmission bottleneck between patients.

Figure S4 - Increased Sampling Improves Evolutionary Rate Estimates. Related to Figure 3. Rate estimates in the recent data set (Figure 3A) have much tighter credible intervals due to the significantly greater amount of time (total coalescent branch length) compared to the initial outbreak.

#Supplemental Tables

  • Table S1. Sample Metadata and Performance. Related to Experimental Procedures.
  • Table S2. Target Erosion. Assessment of current mutations on therapeutics and diagnostics. Related to Figure 4.

#Supplemental Data Files

  • Data S1. Intrahost Data. Related to Experimental Procedures.
  • Data S2. Interhost Data. Related to Experimental Procedures.

References

  1. SF Altschul, W Gish, W Miller, EW Myers, DJ Lipman. Basic local alignment search tool.. Journal of Molecular Biology 215, 403-10 (1990).

  2. Xiao Yang, Patrick Charlebois, Sante Gnerre, Matthew G Coole, Niall J Lennon, Joshua Z Levin, James Qu, Elizabeth M Ryan, Michael C Zody, Matthew R Henn. De novo assembly of highly diverse viral populations. BMC Genomics 13, 475 Springer Science \(\mathplus\) Business Media, 2012. Link

  3. SM Kiełbasa, R Wan, K Sato, P Horton, MC Frith. Adaptive seeds tame genomic sequence comparison.. Genome Research 21, 487-93 (2011).

  4. A. M. Bolger, M. Lohse, B. Usadel. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114–2120 Oxford University Press (OUP), 2014. Link

  5. R. Schmieder, R. Edwards. Quality control and preprocessing of metagenomic datasets. Bioinformatics 27, 863–864 Oxford University Press (OUP), 2011. Link

  6. Manfred G Grabherr, Brian J Haas, Moran Yassour, Joshua Z Levin, Dawn A Thompson, Ido Amit, Xian Adiconis, Lin Fan, Raktima Raychowdhury, Qiandong Zeng, Zehua Chen, Evan Mauceli, Nir Hacohen, Andreas Gnirke, Nicholas Rhind, Federica di Palma, Bruce W Birren, Chad Nusbaum, Kerstin Lindblad-Toh, Nir Friedman, Aviv Regev. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology 29, 644–652 Nature Publishing Group, 2011. Link

  7. Wan-Ping Lee, Michael P. Stromberg, Alistair Ward, Chip Stewart, Erik P. Garrison, Gabor T. Marth. MOSAIK: A Hash-Based Algorithm for Accurate Next-Generation Sequencing Short-Read Mapping. PLoS ONE 9, e90581 Public Library of Science (PLoS), 2014. Link

  8. RC Edgar. MUSCLE: multiple sequence alignment with high accuracy and high throughput.. Nucleic Acids Res 32, 1792-7 (2004).

  9. A McKenna, M Hanna, E Banks, A Sivachenko, K Cibulskis, A Kernytsky, K Garimella, D Altshuler, S Gabriel, M Daly, MA DePristo. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.. Genome Research 20, 1297-303 (2010).

  10. MA DePristo, E Banks, R Poplin, KV Garimella, JR Maguire, C Hartl, AA Philippakis, Angel G del, MA Rivas, M Hanna, A McKenna, TJ Fennell, AM Kernytsky, AY Sivachenko, K Cibulskis, SB Gabriel, D Altshuler, MJ Daly. A framework for variation discovery and genotyping using next-generation DNA sequencing data.. Nat Genet 43, 491-8 (2011).

  11. GA Van der Auwera, MO Carneiro, C Hartl, R Poplin, Angel G Del, A Levy-Moonshine, T Jordan, K Shakir, D Roazen, J Thibault, E Banks, KV Garimella, D Altshuler, S Gabriel, MA DePristo. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline.. Curr Protoc Bioinformatics 11, 11.10.1-11.10.33 (2013).

  12. Xiao Yang, Patrick Charlebois, Alex Macalalad, Matthew R Henn, Michael C Zody. V-Phaser 2: variant inference for viral populations. BMC Genomics 14, 674 Springer Science \(\mathplus\) Business Media, 2013. Link

  13. Pablo Cingolani, Adrian Platts, Le Lily Wang, Melissa Coon, Tung Nguyen, Luan Wang, Susan J. Land, Xiangyi Lu, Douglas M. Ruden. A program for annotating and predicting the effects of single nucleotide polymorphisms SnpEff. Fly 6, 80–92 Informa UK Limited, 2012. Link

  14. Daniel Park, Irwin Jungreis, Chris Tomkins-Tinch, Mike Lin. viral-ngs: v1.0.0. (2015). Link

  15. J. Koster, S. Rahmann. Snakemake–a scalable bioinformatics workflow engine. Bioinformatics 28, 2520–2522 Oxford University Press (OUP), 2012. Link

  16. John D. O’Brien, Vladimir N. Minin, Marc A. Suchard. Learning to Count: Robust Estimates for Labeled Distances between Molecular Sequences. Molecular Biology and Evolution 26, 801–814 (2009). Link

  17. Philippe Lemey, Vladimir N. Minin, Filip Bielejec, Sergei L. Kosakovsky Pond, Marc A. Suchard. A counting renaissance: combining stochastic mapping and empirical Bayes to quickly detect amino acid sites under positive selection. Bioinformatics 28, 3248–3256 (2012). Link

  18. Masami Hasegawa, Hirohisa Kishino, Taka-aki Yano. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. Journal of Molecular Evolution 22, 160–174 (1985). Link

  19. Ziheng Yang. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods. Journal of Molecular Evolution 39, 306–314 (1994). Link

  20. Alexei J Drummond, Simon Y. W Ho, Matthew J Phillips, Andrew Rambaut. Relaxed Phylogenetics and Dating with Confidence. PLoS Biology 4, e88 (2006). Link

  21. Mandev S. Gill, Philippe Lemey, Nuno R. Faria, Andrew Rambaut, Beth Shapiro, Marc A. Suchard. Improving Bayesian Population Dynamics Inference: A Coalescent-Based Model for Multiple Loci. Molecular Biology and Evolution (2012). Link

  22. Marco A. R. Ferreira, Marc A. Suchard. Bayesian analysis of elapsed times in continuous-time Markov chains. Canadian Journal of Statistics 36, 355–368 (2008). Link

  23. N. Goldman, Z. Yang. A codon-based model of nucleotide substitution for protein-coding DNA sequences.. Molecular Biology and Evolution 11, 725–736 (1994). Link

  24. Alexei J Drummond, Marc A Suchard, Dong Xie, Andrew Rambaut. Bayesian Phylogenetics with BEAUti and the BEAST 1.7. Molecular Biology and Evolution 29 (2012). Link

  25. Jeffrey E. Lee, Marnie L. Fusco, Ann J. Hessell, Wendelien B. Oswald, Dennis R. Burton, Erica Ollmann Saphire. Structure of the Ebola virus glycoprotein bound to an antibody from a human survivor. Nature 454, 177–182 (2008). Link

  26. Jeffrey E Lee, Erica Ollmann Saphire. Neutralizing ebolavirus: structural insights into the envelope glycoprotein and antibodies targeted against it. Current Opinion in Structural Biology 19, 408–417 (2009). Link

  27. Daniel Olal, Ana I. Kuehne, Shridhar Bale, Peter Halfmann, Takao Hashiguchi, Marnie L. Fusco, Jeffrey E. Lee, Liam B. King, Yoshihiro Kawaoka, John M. Dye, Erica Ollmann Saphire. Structure of an Antibody in Complex with Its Mucin Domain Linear Epitope That Is Protective against Ebola Virus. Journal of Virology 86, 2809–2816 (2012). Link

  28. Yi-Gang Tong, Wei-Feng Shi, Di Liu, Jun Qian, Long Liang, Xiao-Chen Bo, Jun Liu, Hong-Guang Ren, Hang Fan, Ming Ni, Yang Sun, Yuan Jin, Yue Teng, Zhen Li, David Kargbo, Foday Dafae, Alex Kanu, Cheng-Chao Chen, Zhi-Heng Lan, Hui Jiang, Yang Luo, Hui-Jun Lu, Xiao-Guang Zhang, Fan Yang, Yi Hu, Yu-Xi Cao, Yong-Qiang Deng, Hao-Xiang Su, Yu Sun, Wen-Sen Liu, Zhuang Wang, Cheng-Yu Wang, Zhao-Yang Bu, Zhen-Dong Guo, Liu-Bo Zhang, Wei-Min Nie, Chang-Qing Bai, Chun-Hua Sun, Xiao-Ping An, Pei-Song Xu, Xiang-Li-Lan Zhang, Yong Huang, Zhi-Qiang Mi, Dong Yu, Hong-Wu Yao, Yong Feng, Zhi-Ping Xia, Xue-Xing Zheng, Song-Tao Yang, Bing Lu, Jia-Fu Jiang, Brima Kargbo, Fu-Chu He, George F. Gao, Wu-Chun Cao, Yi-Gang Tong, Jun Qian, Yang Sun, Hui-Jun Lu, Xiao-Guang Zhang, Fan Yang, Yi Hu, Yu-Xi Cao, Yong-Qiang Deng, Hao-Xiang Su, Yu Sun, Wen-Sen Liu, Zhuang Wang, Cheng-Yu Wang, Zhao-Yang Bu, Zhen-Dong Guo, Liu-Bo Zhang, Wei-Min Nie, Chang-Qing Bai, Chun-Hua Sun, Yong Feng, Jia-Fu Jiang, George F. Gao. Genetic diversity and evolutionary dynamics of Ebola virus in Sierra Leone. Nature Nature Publishing Group, 2015. Link

[Someone else is editing this]

You are editing this file