Supplemental materials for: Ebola virus epidemiology, transmission, and evolution during seven months in Sierra Leone
This work has been approved by Institutional Review Boards in Sierra Leone (Sierra Leone Ethics and Scientific Review Committee, SLESRC) and the United States (Harvard Committee on the Use of Human Subjects, CUHS, the CDC’s Human Research Protection Office, HSPO). As part of the EVD outbreak response and surveillance efforts, residual human clinical samples were collected under a waiver of consent granted by SLESRC and CUHS, and the EBOV sequencing work has received non-human subjects research determination by CUHS and HSPO. The Sierra Leone Ministry of Health and Sanitation approved shipment of non-infectious, inactivated samples collected from EVD patients to Broad Institute and Harvard University for viral sequencing. The EBOV-related research and laboratory safety protocols are registered with the Committee of Microbiological Safety (COMS) at Harvard University, and the viral sequencing work is registered with the Institutional Biosafety Committee at Broad Institute. All work with infectious or potentially infectious material was performed at the CDC Viral Special Pathogens Branch in Atlanta, GA, under biosafety level 4 (BSL-4) conditions. Our work was not deemed to be dual-use research of concern.
The viral assembly pipeline began by depleting paired-end reads from each sample of human and other contaminants using best match tagger (BMTagger) (Kirill Rotmistrovsky, Richa Agarwala, BMTagger: Best Match Tagger for removing human reads from metagenomics datasets, 2011. ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/bmtagger/) and the nucleotide basic local alignment search tool (BLASTN) (Altschul et al., 1990). PCR duplicates were removed using a custom modification to Vicuna, M-Vicuna (a custom modification to Vicuna, Yang et al., 2012). The resulting "de-identified" metagenomic datasets were deposited in sequence read archive (SRA, BioProject IDs PRJNA257197 and PRJNA283385). Next, reads were filtered to all members of the Ebolavirus genus (all ebolaviruses including EBOV) using LASTAL (Kiełbasa et al., 2011), quality-trimmed with Trimmomatic (Bolger et al., 2014), and further de-duplicated with PRINSEQ (Schmieder et al., 2011).
The filtered and trimmed reads were subsampled to 100,000 pairs, if available, and de novo assembled using Trinity (Grabherr et al., 2011). Subsequently, reference-assisted assembly improvements (contig scaffolding, gap-filling, etc.) were performed with the Viral Finishing and Annotation Toolkit (V-FAT, http://www.broadinstitute.org/scientific-community/science/projects/viral-genomics/v-fat), which relies on MOSAIK (Lee et al., 2014) and multiple sequence comparison by log expectation (MUSCLE) (Edgar, 2004). Each sample's reads were aligned to its de novo assembly using Novoalign (http://novocraft.com/products/novoalign/), and any remaining duplicates were removed using Picard with MarkDuplicates command (http://broadinstitute.github.io/picard). Variant positions in each assembly were identified using genome analysis toolkit (GATK, McKenna et al., 2010) insertions and deletions realinger (IndelRealigner) and UnifiedGenotyper (DePristo 2011, Van der Auwera 2013) on the read alignments. The assembly was refined to represent the major allele at each variant site, and any positions supported by fewer than three reads were changed to N (4-way ambiguity). This align-call-refine cycle was iterated twice, to minimize reference bias in the assembly.
Intrahost variants (iSNVs) were called from each sample's read alignments using V-Phaser2 (Yang et al., 2013) and subjected to an initial set of filters: variant calls with fewer than five forward or reverse reads or more than a 10-fold strand bias were eliminated. iSNVs were also removed if there was more than a 5-fold difference between the strand bias of the variant call and the strand bias of the reference call. Variant calls that passed these filters were additionally subjected to a 0.5% frequency filter. The final list of iSNVs contains only variant calls that passed all filters in two separate library preparations. Annotated iSNV calls are available in variant call format (VCF) and tabular formats (Data S1). These files infer 100% allele frequencies for all samples at an iSNV position without intrahost variation within the sample, but a clear consensus call during assembly. Annotations were computed with the effect of single nucleotide polymorphisms (SnpEff) program (Cingolani et al., 2012).
Our Linux-based software pipeline is publicly available at https://github.com/broadinstitute/viral-ngs (Park et al., 2015). This pipeline includes command-line tools for each of the above steps and optional Snakemake workflows (Koster et al., 2012) to automate them either sequentially or in parallel. Most of the third-party tools used are either included or can be downloaded and installed automatically, except for GATK and Novoalign, which must be provided by the user due to licensing restrictions.