Authorea

Bryce van de Geijn edited Testing against N masked.tex over 9 years ago

Commit id: 06eab90bf8e14a7edda8541058255b8df713f74b

deletions | additions

\subsection{Comparing WASP mapping to N-masked and personal genome mapping} \subsubsection{Quantifying fraction of reads showing imbalance} % TODO: verify that IMPUTE ref panel info correct To evaluate the accuracy of allelic mapping using WASP, we simulated 100 bp reads from a lymphoblastoid cell line (GM12878) that has been genotyped by the 1000 Genomes and HapMap projects. We additionally imputed and phased genotypes for this cell line with IMPUTE2 \cite{Howie_Donnelly_Marchini_2009} using the 1000 Genomes Phase1 integrated version 3 reference panel.

We evaluated the performance of WASP compared to mapping to a personal or N-masked genome. To create an N-masked genome, we created a copy of the hg19 genome with Ns in place of known variants from the GM12878 cell line. We similarly created maternal and paternal copies of GM12878 using the phased genotypes. For the personal genome, we kept a read if it mapped uniquely to either genome copy. If it mapped to both genomes, we kept the location with the highest mapping quality (ties were broken randomly). We mapped the simulated reads to the original, N-masked, and personal versions of the hg19 genome with BWA \cite{Li_2009} allowing up to 2 mismatches per read ($\verb|-n 2|$), and excluding gapped alignments ($\verb|-o 0|$). For the personal genome mapping, we kept a read if it mapped uniquely to either genome copy. If the read mapped to both genomes, we kept the location with the highest mapping quality (ties were broken randomly). We then ran the reads mapped to the original genome through the WASP pipeline. Finally we calculated the rate of biased mapping using the WASP, N-masked, and personal genome mapping approaches using several different sequencing error rates. \subsubsection{Assessing the effects of mapping bias on an allele specific study} To evaluate the effects of the mapping bias on an allele specific study, we again simulated 100bp reads from GM12878. However, this time we simulated peaks of 100 reads around heterozygous SNPs.