Authorea

David Andrew Eccles edited urlstylerm___hyperse.tex almost 9 years ago

Commit id: 0934b017e0ab6272d664850dfe699d7494e0df86

deletions | additions

rate of 65\%, and a false positive rate of 35\%. These results indicate that the signature SNP set discovered in the present study is considerably more informative than a set of T1D-associated SNPs found in other genome-wide association studies.\section{Discussion} \label{sec:sig-thy-disc} This study has identified a group of 5 SNPs that classify individuals with T1D with good reliability (AUC = 0.84, see Figure~\ref{fig:t1d-validation-top5-ROC-analysis}). The heritability of Type 1 Diabetes is around 88\% \cite{hyttinen03}, so the maximum possible sensitivity (true positive rate) of a genetic test for T1D should be 88\%, with the remaining 12\% of variation being due to non-genetic factors. One of the assumptions made in GWAS is that the individuals selected as candidates for the phenotypic groups (cases and controls) are ideal members of those groups -- affectation status tends to be a binary or integer value that does not allow for intermediate values. Due to the difficulty in qualitatively describing traits, as well as mutation and admixture effects (particularly for population-derived groups), this assumption may be invalidated. The marker construction method used a bootstrapping procedure as an internal validation to remove markers that had substantial variation in $\chi^2$ values within the tested groups. In an ideal case, a bootstrapping procedure would not be necessary as the genetic makeup of the total population will reflect the makeup of any given subgroup of that population. In such a case, the ranking after each bootstrap will be the same as the overall ranking. However, the comparison of minimum and maximum rankings for SNPs across all bootstrap sub-samples has demonstrated that this is clearly not the case (see Section~\ref{sec:sig-thy-bootstrapping}). % banding -- probably more due to discrete genotypes, rather than % actual variation. tests with more SNPs (not shown) display values % with fewer gaps. \subsection{Type 1 Diabetes Study Results} \label{sec:sig-thy-disc-results} It is known that genetic variation within the HLA region on chromosome 6 plays an important role in T1D, accounting for about 50\% of the genetic susceptibility for T1D \cite[see][]{daneman06}. This role is supported by the preliminary results in the present study, which show consistently strong predictive power using genetic markers, all but one from this region alone (see Table~\ref{tab:top5-snps-t1d}). \subsubsection{Accuracy of the Signature SNP Set} \label{sec:t1d-disc-accur-sign-snp} The interpretation of accuracy of a genetic test is difficult, particularly when considering what would be expected if the test were used in an untested population. A statistic that can be useful in this case is the positive predictive value (how likely a test is positive, given a positive result). In order to determine the positive predictive value of a test, it is necessary to establish the prevalence of the trait in the population of individuals who are to be tested. A country which is considered to have a very high incidence of T1D, Finland, has an overall cumulative incidence of around 0.5-0.6\% at the age of 35 years \cite{hyttinen03}. Also, there has been a general trend of a 2-3\% increase in the incidence rate of childhood T1D in South West England over the past 20-30 years, with the incidence in 2003 at around 0.16\% per year \cite{zhao03}. Even at the higher incidence rate in Finland, fewer than 0.6\% of individuals in a typical non-enriched control population would be expected to have T1D. The NBS controls for the WTCCC study had not been enriched to remove individuals that have T1D. Given an expected prevalence of T1D of 0.6\%, it would be expected that around 4 individuals from the validation NBS control group (or 9 from the discovery and validation groups combined) have T1D. Setting the false positive error rate to this value (i.e. 0.6\%) is unrealistic for the current data set, as only a small fraction of T1D cases would be identified with that cutoff (just over 5\%, see Figure~\ref{fig:t1d-validation-top5-ROC-analysis}). However, if a more moderate 5\% false positive error rate is accepted (identifying 43\% of T1D cases, see Section~\ref{sec:meth-summ-validation}), then 36 NBS individuals would be identified by this test as at risk for T1D. This is about ten times that expected by cumulative incidence rates for T1D, indicating a positive predictive value of 10\% with the discovered signature set of 5 SNPs. Given that the population prevalence of T1D is so low, the NBS control group should not differ substantially from an enriched control group, and the positive predictive value of this genetic test will remain around 10\%. \subsubsection{Accuracy in Other Populations} \label{sec:t1d-disc-accur-other-pops} The low positive predictive value of the marker set, together with heritability values of less than 100\%, means that it is unlikely that a genetic test using these T1D markers would be useful as a \emph{diagnostic} test for a general population. However, if used in conjunction with other clinical indicators, it may be appropriate to use these genetic markers for a \emph{screening} test, identifying individuals that should be more closely monitored for T1D symptoms. This is because it will still exclude a large proportion of the normal population, while also identifying a high proportion of at-risk individuals. However, the signature SNP set has not been validated in groups of individuals outside the WTCCC study, and caution should be taken in attempting to extrapolate results to non-validated populations. Taken in the context of disease, it can be very difficult to accurately determine the phenotype of an individual -- this is a particular problem when the disease is a continuous (rather than discrete) trait, as often happens with common complex diseases. Phenotype identification is further complicated by non-Mendelian patterns of inheritance. It is possible for there to be numerous paths to the same apparent end disease, and numerous gene-gene interactions that contribute to the same disease. Furthermore, trait variation is often a mixture of genetic and environmental factors (i.e. heritability is less than 100\%), so potential gene-environment interactions also need to be taken into account when describing phenotype. The effectiveness of any given set of markers will be reduced due to the presence of erroneous false positive results (i.e. some of the false positives will later turn out to have T1D). In a situation where the marker set is constructed to remove as many false positive results as possible, this may result in a refined test that is over-fitted to the initial discovery group of case and control individuals, and is not reliably generalisable to other populations. It is possible that such situations would be apparent when follow-up studies on independent case/control groups for the same trait are carried out, and it is recommended that such validations are carried out before using this signature SNP set. \subsection[Overfitting]{Overfitting Generates Spurious Associations} \label{sec:overfitting} For a genetic association study to be successful, individuals must be separable into distinct groups based on a particular phenotype, and some differences between the groups must be attributable to genetic factors. Methods for identifying associated markers in a GWAS relies on a clear distinction between trait and non-trait individuals. In situations where the trait of interest is not easy to classify, an associated marker may not reflect the true distinction between those groups. In addition, a low genetic influence for the expression of a particular trait can mean that even when a trait can be classified completely, the genetic component of that trait (the only component able to be identified by any DNA marker-based method) will not always determine the observed phenotype completely. Overfitting\index{overfitting} is the generation of a set of distinctive parameters that relies on irrelevant attributes for the model being observed. The problem exists when vital information about the model is missing, and the discovery algorithm ends up being required to derive a model based on other spurious distinctions between discovery groups \cite[see][Chapter 14, pp. 661-663]{russell2003}. Overfitting is applicable to the case of generating minimal marker sets because any such method assumes that a minimal set can be found for the data. When cases and controls are not genetically distinct, and distinct \emph{only} due to the trait under test, any resultant marker set will be invalid. In such a situation, the set of markers generated is informative only for the specific group of individuals that were used for discovery of that set of markers, and will not be applicable for individuals outside the discovery group. Internal validation within groups, and external validation of results in similar populations, is essential to ensure that overfitting has not occurred. Bootstrap sub-sampling uses variance among group sub-samples to remove markers that are associated because of \emph{genetic chance} effects rather than the particular phenotype under test. However, it cannot distinguish between genetic differences due to the tested phenotype and genetic differences due to sampling bias. The problem of overfitting is especially relevant for genetic data, where one pattern of genotypes due to a group-associated factor with high heritability may outweigh the disease-causing factor under test. This is similar to the population stratification problem that has been discussed by \citet{pritchard1999} and \citet{pritchard01} who say that due to the influence of \emph{genetic chance} (e.g.\ genetic drift, founder effects, non-random mating), alleles can appear with high frequency differences between groups within a given population sample even though the differences are not directly associated with the trait of interest. This is particularly important when a population group has a high incidence of a given disease, and the genetic history of the case and/or control subgroups is not known. \citet{pritchard01} recommend testing for structured association in case and control groups before carrying out further association tests in order to remove confounding genetic factors that may be present in a case/control study. \subsubsection{Genome-wide Trait Contributions} \label{sec:sig-thy-disc-genome} While there may be many gene-gene interactions throughout the genome that all contribute to a particular disease, it is unlikely that \emph{all} genetic variants in the subgroup will influence the trait. In addition, some variants may influence the trait more than others and in some cases may even negate the effects of another variant. Both of these factors increase the potential for spurious associations and false positive results when carrying out a whole genome scan. Genotyping carried out in an association study is restricted to a subset of the total genome, because full-genome sequencing is still prohibitively expensive. Also, only a subset of interactions between multiple genetic factors can be studied (if any), because multi-factorial analysis is computationally expensive.\footnote{It has an exponential complexity with respect to the number of factors studied in tandem.} It is expected that any reduction of SNP set size will result in decreased reliability, as there is an information loss when fewer markers are typed. For a reduction method to be useful, the information lost due to typing fewer markers must be compensated by cost reduction. However, in this investigation, the opposite appears to be true -- a small number of markers are useful to distinguish the case and control groups, and appear to provide more information than a full genome set. \subsubsection{Interactions from Multiple Genetic Variants} \label{sec:sig-thy-disc-mult} In some cases, a first-pass single association analysis of markers will not be useful for the classification of a trait. This will be the case for traits that have complex interactions that result in non-linear association patterns between marker frequency and trait prevalence. As an example of a complex interaction, two causative variants may interact in a neutralising fashion (i.e. the effects of one variant are cancelled out by another variant). In this sort of case, a simple one-way association test would not work as expected, retaining a lack of observed association even when there is a strong signal \cite{pickrell07}. Other non-linear interactions between different markers would also reduce the effectiveness of an association test to determine informative markers. The ideal situation for investigating complex traits at a genetic level is an analysis of the effectiveness of \emph{every possible} set of marker interactions. Once such an analysis is carried out, the best set of markers will be identified as being the set that is most informative for classifying individuals into groups. However, the computational requirements for such testing combined with the increased danger of overfitting due to small cell sizes, make such an analysis effectively useless when carried out on the total marker set \cite[see][]{province08}. The bootstrapping approach as outlined here does not consider combinations of genetic markers. However, it provides an efficient way to reduce a large set of markers down to a much smaller set. This smaller set can then be used by programs that determine multi-way interactions, which are typically computationally expensive procedures.