Authorea

David Andrew Eccles added subsection_Linkage_Refinement_label_sec__.tex almost 9 years ago

Commit id: 8d4e95079e56e8769affe781e4c2f0bf4f244d15

deletions | additions

\subsection{Linkage Refinement} \label{sec:meth-summ-refinement} Linked SNPs were removed from the bootstrap-consistent SNP set in order to reduce the redundancy of associative signal produced by the generated SNP set. Markers were ordered based on mean rank order and any SNPs that were linked ($r^2 > 0.1$) with a higher-ranked SNP were removed from the set, leaving an \emph{unlinked set} of 34 SNPs. Markers within a signature marker set should be unlinked, so it is a good idea to calculate a linkage-associated statistic such as $D^\prime$ or $r^2$ during the discovery phase of the analysis, and remove the least informative marker among linked high-association pairs. This step is carried out after the bootstrap sub-sampling process in order to reduce the number of pairwise calculations required for linkage analysis -- pairwise calculations for 500 markers would require 124,750 linkage comparisons,\footnote{$124,750 = (500^2 - 500) / 2$} while pairwise calculations on 500,000 markers would require around $1.25\times10^{11}$ comparisons. % $(x^2 - x) / 2$ \subsection{Set Size Refinement} \label{sec:sig-thy-eval-effect-mark} \begin{figure} \centering \includegraphics[width=0.95\textwidth]{figures/AUC_T1D_1-34_r2filtered.pdf} \caption[Marker Refinement Plot]{A marker refinement plot, showing the effectiveness score (AUC) for increasing numbers of SNPs in the discovery group. The highest AUC value (0.835 for 5 SNPs) is circled in red.} \label{fig:sig-thy-marker-refinement} \end{figure} The optimal marker set size was identified using an Area Under the Curve (AUC) test on the Q-values generated by \textsl{structure} (10,000 bootstraps, and 100,000 total runs), finding marker sets with large differences in mean Q value between the two groups (see Figure~\ref{fig:sig-thy-marker-refinement}). Increasing numbers of markers were selected from the unlinked SNP set based on mean rank order identified during the previous (bootstrap sub-sampling) stage. The effectiveness of a given set of markers was evaluated using the \textsl{structure} program, followed by an AUC calculation for each set of markers based on Q values reported by the program. The \textsl{structure} program outputs values that represent to how genetically similar an individual is to a particular group (Q values), attempting to cluster pooled individuals into two ``populations''.\footnote{The \textsl{structure} program is designed for \emph{population} analysis, but is used here for \emph{group} analysis.} The Q values produced by \textsl{structure} are continuous in the range between 0 and 1 inclusive, and are treated as an estimate of the probability that an individual has a particular trait. Analysis of Q values was used to determine false positive and true positive rates for given Q-value cutoffs (see Figure~\ref{fig:t1d-validation-top5-ROC-analysis}). The true positive rate was calculated as the proportion of T1D cases with Q below the cutoff value, and false positive rate was calculated in the same way for NBS controls. The area under the curve of this graph can be used as an indication of the effectiveness of a quantitative test. An AUC of 1 indicates a perfect test (no misclassification), while an AUC of 0.5 indicates a test that cannot distinguish between groups. The greatest difference between cases and controls was observed when the top 5 SNPs were selected, producing an AUC of 0.8449. This \emph{signature set} of 5 SNPs was considered to be the most appropriate T1D-informative set. \subsection{Validation of Final 5 SNP Set} \label{sec:meth-summ-validation} The signature set of 5 SNPs (see Table~\ref{tab:top5-snps-t1d}) was finally tested on the validation group (982 T1D cases, 729 NBS controls) using \textsl{structure}, followed by an AUC analysis of the Q values. There is a small overlap between some T1D cases and some NBS controls (Figure~\ref{fig:t1d-validation-structure-top5}), but most T1D cases cluster together, and are separate from the cluster of NBS controls. \begin{table} \centering \begin{tabular}{ccccc} \textbf{Marker} & \textbf{Chromosome} & \textbf{Location (Mb)} & \textbf{$\chi^2$} &\textbf{Mean Rank}\\\hline rs9273363 & 6 & 32734250 & 485 & 1\\ rs3957146 & 6 & 32789508 & 317 & 2.2\\ rs3135377 & 6 & 32493377 & 264 & 4.3\\ rs7431934 & 3 & 40268801 & 199 & 13.7\\ rs1046089 & 6 & 31710946 & 108 & 37.9\\ \hline \end{tabular} \caption[T1D SNP Location table]{Location information for the top 5 SNPs discovered in a bootstrap sub-sampled GWAS for T1D associations, after removing linked SNPs, and choosing the set with the highest AUC value. Mean rank reported in this table is based on the marker rank for 100 bootstrap sub-samples. Out of the five markers, four are within a 2Mb region of chromosome 6.} \label{tab:top5-snps-t1d} \end{table}