this is for holding javascript data
David Andrew Eccles edited urlstylerm___hyperse.tex
almost 9 years ago
Commit id: 0934b017e0ab6272d664850dfe699d7494e0df86
deletions | additions
diff --git a/urlstylerm___hyperse.tex b/urlstylerm___hyperse.tex
index 395a6d2..6cce4d4 100644
--- a/urlstylerm___hyperse.tex
+++ b/urlstylerm___hyperse.tex
...
rate of 65\%, and a false positive rate of 35\%. These results
indicate that the signature SNP set discovered in the present study is
considerably more informative than a set of T1D-associated SNPs found
in other genome-wide association studies.
\section{Discussion}
\label{sec:sig-thy-disc}
This study has identified a group of 5 SNPs that classify individuals
with T1D with good reliability (AUC = 0.84, see
Figure~\ref{fig:t1d-validation-top5-ROC-analysis}). The heritability
of Type 1 Diabetes is around 88\% \cite{hyttinen03}, so the maximum
possible sensitivity (true positive rate) of a genetic test for T1D
should be 88\%, with the remaining 12\% of variation being due to
non-genetic factors.
One of the assumptions made in GWAS is that the individuals selected
as candidates for the phenotypic groups (cases and controls) are ideal
members of those groups -- affectation status tends to be a binary or
integer value that does not allow for intermediate values. Due to the
difficulty in qualitatively describing traits, as well as mutation and
admixture effects (particularly for population-derived groups), this
assumption may be invalidated.
The marker construction method used a bootstrapping procedure as an
internal validation to remove markers that had substantial variation
in $\chi^2$ values within the tested groups. In an ideal case, a
bootstrapping procedure would not be necessary as the genetic makeup
of the total population will reflect the makeup of any given subgroup
of that population. In such a case, the ranking after each bootstrap
will be the same as the overall ranking. However, the comparison of
minimum and maximum rankings for SNPs across all bootstrap sub-samples
has demonstrated that this is clearly not the case (see
Section~\ref{sec:sig-thy-bootstrapping}).
% banding -- probably more due to discrete genotypes, rather than
% actual variation. tests with more SNPs (not shown) display values
% with fewer gaps.
\subsection{Type 1 Diabetes Study Results}
\label{sec:sig-thy-disc-results}
It is known that genetic variation within the HLA region on chromosome
6 plays an important role in T1D, accounting for about 50\% of the
genetic susceptibility for T1D \cite[see][]{daneman06}. This role is
supported by the preliminary results in the present study, which show
consistently strong predictive power using genetic markers, all but
one from this region alone (see Table~\ref{tab:top5-snps-t1d}).
\subsubsection{Accuracy of the Signature SNP Set}
\label{sec:t1d-disc-accur-sign-snp}
The interpretation of accuracy of a genetic test is difficult,
particularly when considering what would be expected if the test were
used in an untested population. A statistic that can be useful in this
case is the positive predictive value (how likely a test is positive,
given a positive result).
In order to determine the positive predictive value of a test, it is
necessary to establish the prevalence of the trait in the population
of individuals who are to be tested. A country which is considered to
have a very high incidence of T1D, Finland, has an overall cumulative
incidence of around 0.5-0.6\% at the age of 35 years
\cite{hyttinen03}. Also, there has been a general trend of a 2-3\%
increase in the incidence rate of childhood T1D in South West England
over the past 20-30 years, with the incidence in 2003 at around 0.16\%
per year \cite{zhao03}. Even at the higher incidence rate in Finland,
fewer than 0.6\% of individuals in a typical non-enriched control
population would be expected to have T1D.
The NBS controls for the WTCCC study had not been enriched to remove
individuals that have T1D. Given an expected prevalence of T1D of
0.6\%, it would be expected that around 4 individuals from the
validation NBS control group (or 9 from the discovery and validation
groups combined) have T1D. Setting the false positive error rate to
this value (i.e. 0.6\%) is unrealistic for the current data set, as
only a small fraction of T1D cases would be identified with that
cutoff (just over 5\%, see
Figure~\ref{fig:t1d-validation-top5-ROC-analysis}). However, if a more
moderate 5\% false positive error rate is accepted (identifying 43\%
of T1D cases, see Section~\ref{sec:meth-summ-validation}), then 36 NBS
individuals would be identified by this test as at risk for T1D. This
is about ten times that expected by cumulative incidence rates for
T1D, indicating a positive predictive value of 10\% with the
discovered signature set of 5 SNPs. Given that the population
prevalence of T1D is so low, the NBS control group should not differ
substantially from an enriched control group, and the positive
predictive value of this genetic test will remain around 10\%.
\subsubsection{Accuracy in Other Populations}
\label{sec:t1d-disc-accur-other-pops}
The low positive predictive value of the marker set, together with
heritability values of less than 100\%, means that it is unlikely that
a genetic test using these T1D markers would be useful as a
\emph{diagnostic} test for a general population. However, if used in
conjunction with other clinical indicators, it may be appropriate to
use these genetic markers for a \emph{screening} test, identifying
individuals that should be more closely monitored for T1D symptoms.
This is because it will still exclude a large proportion of the normal
population, while also identifying a high proportion of at-risk
individuals. However, the signature SNP set has not been validated in
groups of individuals outside the WTCCC study, and caution should be
taken in attempting to extrapolate results to non-validated
populations.
Taken in the context of disease, it can be very difficult to
accurately determine the phenotype of an individual -- this is a
particular problem when the disease is a continuous (rather than
discrete) trait, as often happens with common complex diseases.
Phenotype identification is further complicated by non-Mendelian
patterns of inheritance. It is possible for there to be numerous paths
to the same apparent end disease, and numerous gene-gene interactions
that contribute to the same disease. Furthermore, trait variation is
often a mixture of genetic and environmental factors (i.e.
heritability is less than 100\%), so potential gene-environment
interactions also need to be taken into account when describing
phenotype.
The effectiveness of any given set of markers will be reduced due to
the presence of erroneous false positive results (i.e. some of the
false positives will later turn out to have T1D). In a situation where
the marker set is constructed to remove as many false positive results
as possible, this may result in a refined test that is over-fitted to
the initial discovery group of case and control individuals, and is
not reliably generalisable to other populations. It is possible that
such situations would be apparent when follow-up studies on
independent case/control groups for the same trait are carried out,
and it is recommended that such validations are carried out before
using this signature SNP set.
\subsection[Overfitting]{Overfitting Generates Spurious Associations}
\label{sec:overfitting}
For a genetic association study to be successful, individuals must be
separable into distinct groups based on a particular phenotype, and
some differences between the groups must be attributable to genetic
factors. Methods for identifying associated markers in a GWAS relies
on a clear distinction between trait and non-trait individuals. In
situations where the trait of interest is not easy to classify, an
associated marker may not reflect the true distinction between those
groups. In addition, a low genetic influence for the expression of a
particular trait can mean that even when a trait can be classified
completely, the genetic component of that trait (the only component
able to be identified by any DNA marker-based method) will not always
determine the observed phenotype completely.
Overfitting\index{overfitting} is the generation of a set of
distinctive parameters that relies on irrelevant attributes for the
model being observed. The problem exists when vital information about
the model is missing, and the discovery algorithm ends up being
required to derive a model based on other spurious distinctions
between discovery groups \cite[see][Chapter 14, pp.
661-663]{russell2003}. Overfitting is applicable to the case of
generating minimal marker sets because any such method assumes that a
minimal set can be found for the data. When cases and controls are not
genetically distinct, and distinct \emph{only} due to the trait under
test, any resultant marker set will be invalid. In such a situation,
the set of markers generated is informative only for the specific
group of individuals that were used for discovery of that set of
markers, and will not be applicable for individuals outside the
discovery group. Internal validation within groups, and external
validation of results in similar populations, is essential to ensure
that overfitting has not occurred.
Bootstrap sub-sampling uses variance among group sub-samples to remove
markers that are associated because of \emph{genetic chance} effects
rather than the particular phenotype under test. However, it cannot
distinguish between genetic differences due to the tested phenotype
and genetic differences due to sampling bias. The problem of
overfitting is especially relevant for genetic data, where one pattern
of genotypes due to a group-associated factor with high heritability
may outweigh the disease-causing factor under test. This is similar to
the population stratification problem that has been discussed by
\citet{pritchard1999} and \citet{pritchard01} who say that due to the
influence of \emph{genetic chance} (e.g.\ genetic drift, founder
effects, non-random mating), alleles can appear with high frequency
differences between groups within a given population sample even
though the differences are not directly associated with the trait of
interest. This is particularly important when a population group has a
high incidence of a given disease, and the genetic history of the case
and/or control subgroups is not known. \citet{pritchard01} recommend
testing for structured association in case and control groups before
carrying out further association tests in order to remove confounding
genetic factors that may be present in a case/control study.
\subsubsection{Genome-wide Trait Contributions}
\label{sec:sig-thy-disc-genome}
While there may be many gene-gene interactions throughout the genome
that all contribute to a particular disease, it is unlikely that
\emph{all} genetic variants in the subgroup will influence the trait.
In addition, some variants may influence the trait more than others
and in some cases may even negate the effects of another variant. Both
of these factors increase the potential for spurious associations and
false positive results when carrying out a whole genome scan.
Genotyping carried out in an association study is restricted to a
subset of the total genome, because full-genome sequencing is still
prohibitively expensive. Also, only a subset of interactions between
multiple genetic factors can be studied (if any), because
multi-factorial analysis is computationally expensive.\footnote{It has
an exponential complexity with respect to the number of factors
studied in tandem.}
It is expected that any reduction of SNP set size will result in
decreased reliability, as there is an information loss when fewer
markers are typed. For a reduction method to be useful, the
information lost due to typing fewer markers must be compensated by
cost reduction. However, in this investigation, the opposite appears
to be true -- a small number of markers are useful to distinguish the
case and control groups, and appear to provide more information than a
full genome set.
\subsubsection{Interactions from Multiple Genetic Variants}
\label{sec:sig-thy-disc-mult}
In some cases, a first-pass single association analysis of markers
will not be useful for the classification of a trait. This will be the
case for traits that have complex interactions that result in
non-linear association patterns between marker frequency and trait
prevalence. As an example of a complex interaction, two causative
variants may interact in a neutralising fashion (i.e. the effects of
one variant are cancelled out by another variant). In this sort of
case, a simple one-way association test would not work as expected,
retaining a lack of observed association even when there is a strong
signal \cite{pickrell07}. Other non-linear interactions between
different markers would also reduce the effectiveness of an
association test to determine informative markers.
The ideal situation for investigating complex traits at a genetic
level is an analysis of the effectiveness of \emph{every possible} set
of marker interactions. Once such an analysis is carried out, the best
set of markers will be identified as being the set that is most
informative for classifying individuals into groups. However, the
computational requirements for such testing combined with the
increased danger of overfitting due to small cell sizes, make such an
analysis effectively useless when carried out on the total marker set
\cite[see][]{province08}.
The bootstrapping approach as outlined here does not consider
combinations of genetic markers. However, it provides an efficient way
to reduce a large set of markers down to a much smaller set. This
smaller set can then be used by programs that determine multi-way
interactions, which are typically computationally expensive
procedures.