Supplemental Material for WASP: allele-specific methods for unbiased discovery of molecular quantitative trait loci
This block failed to display. Double-click this text to correct any errors. Your changes are saved.
To detect differences in molecular phenotypes from sequencing data it is essential to remove read mapping biases, which are a major source of false positives. The WASP read mapping pipeline accomplishes this task by ensuring that the mapping of each individual read is unbiased.
In the WASP mapping pipeline, the user first maps reads to the genome using any mapper that outputs BAM or SAM format. For example, ChIP-seq reads can be mapped by BWA (Li 2009) or Bowtie 2 (Langmead 2012), and RNA-seq reads can be mapped using tophat (Trapnell 2009). WASP then identifies mapped reads that overlap with known polymorphisms. For each read that overlaps a polymorphism, all possible allelic combinations that differ from the original read are generated and re-mapped to the genome. For example, when a read overlaps two bi-allelic SNPs four allelic combinations are possible, three of which differ from the original read. The original read is discarded if any of the allelic combinations map non-uniquely or map to another location.
This simple method has the advantages that it works with almost any existing mapping pipeline and it handles reads with sequencing errors, which are a major source of biased mapping (Degner 2009).
To test the accuracy of allelic mapping using WASP, we simulated 100 bp reads from a lymphoblastoid cell line (GM12878) that has been genotyped by the 1000 Genomes and HapMap projects. We additionally imputed and phased genotypes for this cell line with IMPUTE2 (Howie 2009) using the 1000 Genomes Phase1 integrated version 3 reference panel.
For each test, we evaluated the performance of WASP compared to mapping to a personal or N-masked genome. To create an N-masked genome, we created a copy of the hg19 genome with Ns in place of known variants from the GM12878 cell line. We similarly created maternal and paternal copies of GM12878 using the phased genotypes. We mapped the simulated reads to the original, N-masked, and personal versions of the hg19 genome with BWA (Li 2009) allowing up to 2 mismatches per read (\(aureplacedverbaa \)), and excluding gapped alignments (\(aureplacedverbaaa \)). For the personal genome, we kept a read if it mapped uniquely to either genome copy. If it mapped to both genomes, we kept the location with the highest mapping quality (ties were broken randomly).
We first identified each base where a read starting at that base would overlap a heterozygous site. We generated reads from each haplotype while introducing identical sequencing errors at a predefined rate. For each mapping type, we considered the mapping of a read to be biased if the read from one haplotype mapped to the correct location but the other did not.
For each heterozygous site, we simulated 100 reads from random bases that overlap the chosen SNP. We chose the haplotype of each simulated read at random. Reads from peaks without effects came from haplotype 1 vs haplotype 2 with a 1:1 ratio. Reads from peaks with effects were simulated with ratios ranging from 1.3:1 to 2.5:1 to test a range of effect sizes.
For each effect size, we simulated sets of peaks that were composed of \(90\%\) null peaks and \(10\%\) peaks with effects. We mapped the sets with each mapping scheme and performed a binomial test for imbalance on each peak, calling significantly imbalanced loci at 10% FDR. We finally assessed the fraction of significant loci that came from the null peaks. If there is no mapping induced imbalance, this should be 10%.