Linkage Refinement

\label{sec:meth-summ-refinement}

Linked GSMs were removed in order to reduce the redundancy of associative signal produced by the generated GSM set. Markers were ordered based on mean rank order and any SNPs that were linked (\(r^{2}>0.1\)) with a higher-ranked GSM were removed from the set, leaving an unlinked set of 34 GSMs.

Markers within a signature marker set should be unlinked, so it is a good idea to calculate a linkage-associated statistic such as \(D^{\prime}\) or \(r^{2}\) during the discovery phase of the analysis, and remove the least informative marker among linked high-association pairs. This step is carried out after the bootstrap sub-sampling process in order to reduce the number of pairwise calculations required for linkage analysis – pairwise calculations for 500 markers would require 124,750 linkage comparisons,11\(124,750=(500^{2}-500)/2\) while pairwise calculations on 500,000 markers would require around \(1.25\times 10^{11}\) comparisons.

Set Size Refinement

\label{sec:sig-thy-eval-effect-mark}

The optimal marker set size was identified using an Area Under the Curve (AUC) test on the Q-values generated by structure (10,000 bootstraps, and 100,000 total runs), finding marker sets with large differences in mean Q value between the two groups (see Figure \ref{fig:sig-thy-marker-refinement}). Increasing numbers of markers were selected from the unlinked GSM set based on mean rank order identified during the previous (bootstrap sub-sampling) stage. The effectiveness of a given set of markers was evaluated using the structure program, followed by an AUC calculation for each set of markers based on Q values reported by the program.

The structure program outputs values that represent to how genetically similar an individual is to a particular group (Q values), attempting to cluster pooled individuals into two “populations”.22The structure program is designed for population analysis, but is used here for group analysis. The Q values produced by structure are continuous in the range between 0 and 1 inclusive, and are treated as an estimate of the probability that an individual has a particular trait.

Analysis of Q values was used to determine false positive and true positive rates for given Q-value cutoffs (see Figure \ref{fig:t1d-validation-top5-ROC-analysis}). The true positive rate was calculated as the proportion of T1D cases with Q below the cutoff value, and false positive rate was calculated in the same way for NBS controls. The area under the curve of this graph can be used as an indication of the effectiveness of a quantitative test. An AUC of 1 indicates a perfect test (no misclassification), while an AUC of 0.5 indicates a test that cannot distinguish between groups.

The greatest difference between cases and controls was observed when the top 5 GSMs were selected, producing an AUC of 0.8449. This signature set of 5 GSMs was considered to be the most appropriate T1D-informative set.

Validation of Final 5 GSM Set

\label{sec:meth-summ-validation}

The signature set of 5 GSMs (see Table \ref{tab:top5-snps-t1d}) was finally tested on the validation group (982 T1D cases, 729 NBS controls) using structure, followed by an AUC analysis of the Q values. There is a small overlap between some T1D cases and some NBS controls (Figure \ref{fig:t1d-validation-structure-top5}), but most T1D cases cluster together, and are separate from the cluster of NBS controls.

\label{tab:top5-snps-t1d}Location information for the top 5 GSMs discovered in a bootstrap sub-sampled GWAS for T1D associations, after removing linked GSMs, and choosing the set with the highest AUC value. Mean rank reported in this table is based on the marker rank for 100 bootstrap sub-samples. Out of the five markers, four are within a 2Mb region of chromosome 6.
Marker Chromosome Location (Mb) \(\chi^{2}\) Mean Rank
rs9273363 6 32734250 485 1
rs3957146 6 32789508 317 2.2
rs3135377 6 32493377 264 4.3
rs7431934 3 40268801 199 13.7
rs1046089 6 31710946 108 37.9

The AUC value associated with this test of the signature set of 5 GSMs in the validation group was 0.8395. Setting the false positive rate to 5% (cutoff Q value 0.129) produced a true positive rate of 43%, while setting the true positive rate to 85% (cutoff Q value 0.5583) produced a false positive rate of 38%. The position on the curve nearest to a true positive rate of 100% and a false positive rate of 0% was when the cutoff Q value was set at 0.506, with a true positive rate of 78%, and a false positive rate of 29%.