David Andrew Eccles edited urlstylerm___hyperse.tex  almost 9 years ago

Commit id: 0934b017e0ab6272d664850dfe699d7494e0df86

deletions | additions      

       

rate of 65\%, and a false positive rate of 35\%. These results  indicate that the signature SNP set discovered in the present study is  considerably more informative than a set of T1D-associated SNPs found  in other genome-wide association studies.\section{Discussion}  \label{sec:sig-thy-disc}  This study has identified a group of 5 SNPs that classify individuals  with T1D with good reliability (AUC = 0.84, see  Figure~\ref{fig:t1d-validation-top5-ROC-analysis}). The heritability  of Type 1 Diabetes is around 88\% \cite{hyttinen03}, so the maximum  possible sensitivity (true positive rate) of a genetic test for T1D  should be 88\%, with the remaining 12\% of variation being due to  non-genetic factors.  One of the assumptions made in GWAS is that the individuals selected  as candidates for the phenotypic groups (cases and controls) are ideal  members of those groups -- affectation status tends to be a binary or  integer value that does not allow for intermediate values. Due to the  difficulty in qualitatively describing traits, as well as mutation and  admixture effects (particularly for population-derived groups), this  assumption may be invalidated.  The marker construction method used a bootstrapping procedure as an  internal validation to remove markers that had substantial variation  in $\chi^2$ values within the tested groups. In an ideal case, a  bootstrapping procedure would not be necessary as the genetic makeup  of the total population will reflect the makeup of any given subgroup  of that population. In such a case, the ranking after each bootstrap  will be the same as the overall ranking. However, the comparison of  minimum and maximum rankings for SNPs across all bootstrap sub-samples  has demonstrated that this is clearly not the case (see  Section~\ref{sec:sig-thy-bootstrapping}).  % banding -- probably more due to discrete genotypes, rather than  % actual variation. tests with more SNPs (not shown) display values  % with fewer gaps.  \subsection{Type 1 Diabetes Study Results}  \label{sec:sig-thy-disc-results}  It is known that genetic variation within the HLA region on chromosome  6 plays an important role in T1D, accounting for about 50\% of the  genetic susceptibility for T1D \cite[see][]{daneman06}. This role is  supported by the preliminary results in the present study, which show  consistently strong predictive power using genetic markers, all but  one from this region alone (see Table~\ref{tab:top5-snps-t1d}).  \subsubsection{Accuracy of the Signature SNP Set}  \label{sec:t1d-disc-accur-sign-snp}  The interpretation of accuracy of a genetic test is difficult,  particularly when considering what would be expected if the test were  used in an untested population. A statistic that can be useful in this  case is the positive predictive value (how likely a test is positive,  given a positive result).  In order to determine the positive predictive value of a test, it is  necessary to establish the prevalence of the trait in the population  of individuals who are to be tested. A country which is considered to  have a very high incidence of T1D, Finland, has an overall cumulative  incidence of around 0.5-0.6\% at the age of 35 years  \cite{hyttinen03}. Also, there has been a general trend of a 2-3\%  increase in the incidence rate of childhood T1D in South West England  over the past 20-30 years, with the incidence in 2003 at around 0.16\%  per year \cite{zhao03}. Even at the higher incidence rate in Finland,  fewer than 0.6\% of individuals in a typical non-enriched control  population would be expected to have T1D.  The NBS controls for the WTCCC study had not been enriched to remove  individuals that have T1D. Given an expected prevalence of T1D of  0.6\%, it would be expected that around 4 individuals from the  validation NBS control group (or 9 from the discovery and validation  groups combined) have T1D. Setting the false positive error rate to  this value (i.e. 0.6\%) is unrealistic for the current data set, as  only a small fraction of T1D cases would be identified with that  cutoff (just over 5\%, see  Figure~\ref{fig:t1d-validation-top5-ROC-analysis}). However, if a more  moderate 5\% false positive error rate is accepted (identifying 43\%  of T1D cases, see Section~\ref{sec:meth-summ-validation}), then 36 NBS  individuals would be identified by this test as at risk for T1D. This  is about ten times that expected by cumulative incidence rates for  T1D, indicating a positive predictive value of 10\% with the  discovered signature set of 5 SNPs. Given that the population  prevalence of T1D is so low, the NBS control group should not differ  substantially from an enriched control group, and the positive  predictive value of this genetic test will remain around 10\%.  \subsubsection{Accuracy in Other Populations}  \label{sec:t1d-disc-accur-other-pops}  The low positive predictive value of the marker set, together with  heritability values of less than 100\%, means that it is unlikely that  a genetic test using these T1D markers would be useful as a  \emph{diagnostic} test for a general population. However, if used in  conjunction with other clinical indicators, it may be appropriate to  use these genetic markers for a \emph{screening} test, identifying  individuals that should be more closely monitored for T1D symptoms.  This is because it will still exclude a large proportion of the normal  population, while also identifying a high proportion of at-risk  individuals. However, the signature SNP set has not been validated in  groups of individuals outside the WTCCC study, and caution should be  taken in attempting to extrapolate results to non-validated  populations.  Taken in the context of disease, it can be very difficult to  accurately determine the phenotype of an individual -- this is a  particular problem when the disease is a continuous (rather than  discrete) trait, as often happens with common complex diseases.  Phenotype identification is further complicated by non-Mendelian  patterns of inheritance. It is possible for there to be numerous paths  to the same apparent end disease, and numerous gene-gene interactions  that contribute to the same disease. Furthermore, trait variation is  often a mixture of genetic and environmental factors (i.e.  heritability is less than 100\%), so potential gene-environment  interactions also need to be taken into account when describing  phenotype.  The effectiveness of any given set of markers will be reduced due to  the presence of erroneous false positive results (i.e. some of the  false positives will later turn out to have T1D). In a situation where  the marker set is constructed to remove as many false positive results  as possible, this may result in a refined test that is over-fitted to  the initial discovery group of case and control individuals, and is  not reliably generalisable to other populations. It is possible that  such situations would be apparent when follow-up studies on  independent case/control groups for the same trait are carried out,  and it is recommended that such validations are carried out before  using this signature SNP set.  \subsection[Overfitting]{Overfitting Generates Spurious Associations}  \label{sec:overfitting}  For a genetic association study to be successful, individuals must be  separable into distinct groups based on a particular phenotype, and  some differences between the groups must be attributable to genetic  factors. Methods for identifying associated markers in a GWAS relies  on a clear distinction between trait and non-trait individuals. In  situations where the trait of interest is not easy to classify, an  associated marker may not reflect the true distinction between those  groups. In addition, a low genetic influence for the expression of a  particular trait can mean that even when a trait can be classified  completely, the genetic component of that trait (the only component  able to be identified by any DNA marker-based method) will not always  determine the observed phenotype completely.  Overfitting\index{overfitting} is the generation of a set of  distinctive parameters that relies on irrelevant attributes for the  model being observed. The problem exists when vital information about  the model is missing, and the discovery algorithm ends up being  required to derive a model based on other spurious distinctions  between discovery groups \cite[see][Chapter 14, pp.  661-663]{russell2003}. Overfitting is applicable to the case of  generating minimal marker sets because any such method assumes that a  minimal set can be found for the data. When cases and controls are not  genetically distinct, and distinct \emph{only} due to the trait under  test, any resultant marker set will be invalid. In such a situation,  the set of markers generated is informative only for the specific  group of individuals that were used for discovery of that set of  markers, and will not be applicable for individuals outside the  discovery group. Internal validation within groups, and external  validation of results in similar populations, is essential to ensure  that overfitting has not occurred.  Bootstrap sub-sampling uses variance among group sub-samples to remove  markers that are associated because of \emph{genetic chance} effects  rather than the particular phenotype under test. However, it cannot  distinguish between genetic differences due to the tested phenotype  and genetic differences due to sampling bias. The problem of  overfitting is especially relevant for genetic data, where one pattern  of genotypes due to a group-associated factor with high heritability  may outweigh the disease-causing factor under test. This is similar to  the population stratification problem that has been discussed by  \citet{pritchard1999} and \citet{pritchard01} who say that due to the  influence of \emph{genetic chance} (e.g.\ genetic drift, founder  effects, non-random mating), alleles can appear with high frequency  differences between groups within a given population sample even  though the differences are not directly associated with the trait of  interest. This is particularly important when a population group has a  high incidence of a given disease, and the genetic history of the case  and/or control subgroups is not known. \citet{pritchard01} recommend  testing for structured association in case and control groups before  carrying out further association tests in order to remove confounding  genetic factors that may be present in a case/control study.  \subsubsection{Genome-wide Trait Contributions}  \label{sec:sig-thy-disc-genome}  While there may be many gene-gene interactions throughout the genome  that all contribute to a particular disease, it is unlikely that  \emph{all} genetic variants in the subgroup will influence the trait.  In addition, some variants may influence the trait more than others  and in some cases may even negate the effects of another variant. Both  of these factors increase the potential for spurious associations and  false positive results when carrying out a whole genome scan.  Genotyping carried out in an association study is restricted to a  subset of the total genome, because full-genome sequencing is still  prohibitively expensive. Also, only a subset of interactions between  multiple genetic factors can be studied (if any), because  multi-factorial analysis is computationally expensive.\footnote{It has  an exponential complexity with respect to the number of factors  studied in tandem.}  It is expected that any reduction of SNP set size will result in  decreased reliability, as there is an information loss when fewer  markers are typed. For a reduction method to be useful, the  information lost due to typing fewer markers must be compensated by  cost reduction. However, in this investigation, the opposite appears  to be true -- a small number of markers are useful to distinguish the  case and control groups, and appear to provide more information than a  full genome set.  \subsubsection{Interactions from Multiple Genetic Variants}  \label{sec:sig-thy-disc-mult}  In some cases, a first-pass single association analysis of markers  will not be useful for the classification of a trait. This will be the  case for traits that have complex interactions that result in  non-linear association patterns between marker frequency and trait  prevalence. As an example of a complex interaction, two causative  variants may interact in a neutralising fashion (i.e. the effects of  one variant are cancelled out by another variant). In this sort of  case, a simple one-way association test would not work as expected,  retaining a lack of observed association even when there is a strong  signal \cite{pickrell07}. Other non-linear interactions between  different markers would also reduce the effectiveness of an  association test to determine informative markers.  The ideal situation for investigating complex traits at a genetic  level is an analysis of the effectiveness of \emph{every possible} set  of marker interactions. Once such an analysis is carried out, the best  set of markers will be identified as being the set that is most  informative for classifying individuals into groups. However, the  computational requirements for such testing combined with the  increased danger of overfitting due to small cell sizes, make such an  analysis effectively useless when carried out on the total marker set  \cite[see][]{province08}.  The bootstrapping approach as outlined here does not consider  combinations of genetic markers. However, it provides an efficient way  to reduce a large set of markers down to a much smaller set. This  smaller set can then be used by programs that determine multi-way  interactions, which are typically computationally expensive  procedures.