Authorea

Bryce van de Geijn edited Testing against N masked.tex over 9 years ago

Commit id: c6fd0f10cc103deddfc269bc715bf9cac55a98e8

deletions | additions

\subsection{Comparing WASP mapping to N-masked and personal genome mapping} % TODO: verify that IMPUTE ref panel info correct To evaluate test the accuracy of allelic mapping using WASP, we simulated 100 bp reads from a lymphoblastoid cell line (GM12878) that has been genotyped by the 1000 Genomes and HapMap projects. We additionally imputed and phased genotypes for this cell line with IMPUTE2 \cite{Howie_Donnelly_Marchini_2009} using the 1000 Genomes Phase1 integrated version 3 reference panel. For each test, we evaluated the performance of WASP compared to mapping to a personal or N-masked genome. To create an N-masked genome, we created a copy of the hg19 genome with Ns in place of known variants from the GM12878 cell line. We similarly created maternal and paternal copies of GM12878 using the phased genotypes. We mapped the simulated reads to the original, N-masked, and personal versions of the hg19 genome with BWA \cite{Li_2009} allowing up to 2 mismatches per read ($\verb|-n 2|$), and excluding gapped alignments ($\verb|-o 0|$). For the personal genome, we kept a read if it mapped uniquely to either genome copy. If it mapped to both genomes, we kept the location with the highest mapping quality (ties were broken randomly). \subsubsection{Quantifying fraction of reads showing imbalance} We first identified each base where a read starting at that base would overlap a heterozygous site. We generated reads from each haplotype while introducing identical sequencing errors at a predefined rate. We For each mapping type, we considered the mapping of a read to be biased if the read from one haplotype mapped to the correct location but the other did not.We mapped the simulated reads to the original, N-masked, and personal versions of the hg19 genome with BWA \cite{Li_2009} allowing up to 2 mismatches per read ($\verb|-n 2|$), and excluding gapped alignments ($\verb|-o 0|$). For the personal genome mapping, we kept a read if it mapped uniquely to either genome copy. If the read mapped to both genomes, we kept the location with the highest mapping quality (ties were broken randomly). We then ran the reads mapped to the original genome through the WASP pipeline. Finally we calculated the rate of biased mapping using the WASP, N-masked, and personal genome mapping approaches using several different sequencing error rates. \subsubsection{Assessing the effects of mapping bias on an allele specific study} To For each heterozygous site, we simulated 100 reads from random bases that overlap the chosen SNP. We chose the haplotype of each simulated read at random. Reads from peaks without effects came from haplotype 1 vs haplotype 2 with a 1:1 ratio. Reads from peaks with effects were simulated with ratios ranging from 1.3:1 to 2.5:1 to test a range of effect sizes. For each effect size, we simulated sets of peaks that were $90%$ from the nullTo evaluate the effects of the mapping bias on an allele specific study, we again simulated 100bp reads from GM12878. However, this time we simulated peaks of 100 reads around heterozygous SNPs.