David Andrew Eccles added subsection_Linkage_Refinement_label_sec__.tex  almost 9 years ago

Commit id: 8d4e95079e56e8769affe781e4c2f0bf4f244d15

deletions | additions      

         

\subsection{Linkage Refinement}  \label{sec:meth-summ-refinement}  Linked SNPs were removed from the bootstrap-consistent SNP set in  order to reduce the redundancy of associative signal produced by the  generated SNP set. Markers were ordered based on mean rank order and  any SNPs that were linked ($r^2 > 0.1$) with a higher-ranked SNP were  removed from the set, leaving an \emph{unlinked set} of 34 SNPs.  Markers within a signature marker set should be unlinked, so it is a  good idea to calculate a linkage-associated statistic such as  $D^\prime$ or $r^2$ during the discovery phase of the analysis, and  remove the least informative marker among linked high-association  pairs. This step is carried out after the bootstrap sub-sampling  process in order to reduce the number of pairwise calculations  required for linkage analysis -- pairwise calculations for 500 markers  would require 124,750 linkage comparisons,\footnote{$124,750 = (500^2 -  500) / 2$} while pairwise calculations  on 500,000 markers would require around $1.25\times10^{11}$  comparisons.  % $(x^2 - x) / 2$  \subsection{Set Size Refinement}  \label{sec:sig-thy-eval-effect-mark}  \begin{figure}  \centering  \includegraphics[width=0.95\textwidth]{figures/AUC_T1D_1-34_r2filtered.pdf}  \caption[Marker Refinement Plot]{A marker refinement plot, showing  the effectiveness score (AUC) for increasing numbers of SNPs in  the discovery group. The highest AUC value (0.835 for 5 SNPs) is  circled in red.}  \label{fig:sig-thy-marker-refinement}  \end{figure}  The optimal marker set size was identified using an Area Under the  Curve (AUC) test on the Q-values generated by \textsl{structure} (10,000  bootstraps, and 100,000 total runs), finding marker sets with large  differences in mean Q value between the two groups (see  Figure~\ref{fig:sig-thy-marker-refinement}). Increasing numbers of  markers were selected from the unlinked SNP set based on mean rank  order identified during the previous (bootstrap sub-sampling) stage.  The effectiveness of a given set of markers was evaluated using the  \textsl{structure} program, followed by an AUC calculation for  each set of markers based on Q values reported by the program.  The \textsl{structure} program outputs values that represent to how  genetically similar an individual is to a particular group (Q values),  attempting to cluster pooled individuals into two  ``populations''.\footnote{The \textsl{structure} program is designed  for \emph{population} analysis, but is used here for \emph{group}  analysis.} The Q values produced by \textsl{structure} are  continuous in the range between 0 and 1 inclusive, and are treated as  an estimate of the probability that an individual has a particular  trait.  Analysis of Q values was used to determine false positive and true  positive rates for given Q-value cutoffs (see  Figure~\ref{fig:t1d-validation-top5-ROC-analysis}). The true positive  rate was calculated as the proportion of T1D cases with Q below the  cutoff value, and false positive rate was calculated in the same way  for NBS controls. The area under the curve of this graph can be used  as an indication of the effectiveness of a quantitative test. An AUC  of 1 indicates a perfect test (no misclassification), while an AUC of  0.5 indicates a test that cannot distinguish between groups.  The greatest difference between cases and controls was observed when  the top 5 SNPs were selected, producing an AUC of 0.8449. This  \emph{signature set} of 5 SNPs was considered to be the most  appropriate T1D-informative set.  \subsection{Validation of Final 5 SNP Set}  \label{sec:meth-summ-validation}  The signature set of 5 SNPs (see Table~\ref{tab:top5-snps-t1d}) was  finally tested on the validation group (982 T1D cases, 729 NBS  controls) using \textsl{structure}, followed by an AUC analysis of the Q  values. There is a small overlap between some T1D cases and some NBS  controls (Figure~\ref{fig:t1d-validation-structure-top5}), but most  T1D cases cluster together, and are separate from the cluster of NBS  controls.  \begin{table}  \centering  \begin{tabular}{ccccc}  \textbf{Marker} & \textbf{Chromosome} &  \textbf{Location (Mb)} & \textbf{$\chi^2$} &\textbf{Mean Rank}\\\hline  rs9273363 & 6 & 32734250 & 485 & 1\\  rs3957146 & 6 & 32789508 & 317 & 2.2\\  rs3135377 & 6 & 32493377 & 264 & 4.3\\  rs7431934 & 3 & 40268801 & 199 & 13.7\\  rs1046089 & 6 & 31710946 & 108 & 37.9\\  \hline  \end{tabular}  \caption[T1D SNP Location table]{Location  information for the top 5 SNPs discovered in a bootstrap  sub-sampled GWAS for T1D associations, after removing linked SNPs,  and choosing the set with the highest AUC value. Mean rank  reported in this table is based on the marker rank for 100  bootstrap sub-samples. Out of the five  markers, four are within a 2Mb region of chromosome 6.}  \label{tab:top5-snps-t1d}  \end{table}