Bootstrap Distillation: Non-parametric Internal Validation of GWAS Results by Subgroup Resampling

David A. Eccles, Rodney A. Lea and Geoffrey K. Chambers


Genome-wide Association Studies are carried out on a large number of genetic variants in a large number of people, allowing the detection of small genetic effects that are associated with a trait. Natural variation of genotypes within populations means that any particular sample from the population may not represent the true genotype frequencies within that population. This may lead to the observation of marker-disease associations when no such association exists.

A bootstrap population sub-sampling technique can reduce the influence of allele frequency variation in producing false-positive results for particular samplings of the population. In order to utilise bioinformatics in the service of a serious disease, this sub-sampling method has been applied to the Type 1 Diabetes dataset from the Wellcome Trust Case Control Consortium in order to evaluate its effectiveness.

While previous literature on Type 1 Diabetes has identified some DNA variants that are associated with the disease, these variants are not informative for distinguishing between disease cases and controls using genetic information alone (AUC=0.7284). Population sub-sampling filtered out noise from genome-wide association data, and increased the chance of finding useful associative signals. Subsequent filtering based on marker linkage and testing of marker sets of different sizes produced a 5-SNP signature set of markers for Type 1 Diabetes. The group-specific markers used in this set, primarily from the HLA region on chromosome 6, are considerably more informative than previously known associated variants for predicting T1D phenotype from genetic data (AUC=0.8395). Given this predictive quality, the signature set may be useful alone as a screening test, and would be particularly useful in combination with other clinical cofactors for Type 1 Diabetes risk.



Personalised medical treatment based on genome profiles is a major goal of genetic research in the \(21^{st}\) century (see Avery et al., 2009; Province et al., 2008). However, complex genotype-environment interactions for common diseases make it difficult to determine which specific genetic features should be used to construct such profiles. Hence the prediction of genetic risk is a major challenge of the \(21^{st}\) century.

The introduction of large-scale Single Nucleotide Polymorphism (SNP) genotyping systems has enabled genetic variants to be typed en-masse, shifting the main effort required in a genetic risk study from genotyping to data analysis (or bioinformatics). Here we investigate genetic markers for Type 1 Diabetes (T1D), demonstrating how a population sub-sampling method may assist in the identification of risk markers for a complex disease.

Type 1 Diabetes


Type 1 Diabetes mellitus (T1D) is a disorder typically characterised by an absence of insulin-producing beta cells in the pancreas, either through loss of the cells themselves, or through the reduction in capacity of the cells to produce insulin (see Atkinson et al., 2014). This disorder shares with the more common Type 2 Diabetes mellitus (T2D) a characteristic symptom of high blood glucose levels. In some cases, this glucose also passes through to the urine, creating a sticky/sweet substance that attracts ants (see Ekoé et al., 2002, pp. 7,11). In T2D, this high blood glucose is caused by cells not responding to insulin (insulin resistance), while in T1D the excess is caused by a reduction in insulin production (insulin dependence).

The incidence of T1D varies throughout the world, with rates of incidence as low as 0.0006% per year in China, 0.02% in the UK, up to nearly 0.05% per year in Finland. About 50-60% of cases of T1D manifest in childhood (younger than 18 years), and the disease is believed to be caused by an abnormal immune response after exposure to environmental triggers such as viruses, toxins or food (see Daneman, 2006). While a spring birth is correlated with T1D risk, the diagnosis of Type 1 Diabetes is more common in autumn and winter (see Atkinson et al., 2014).

Symptoms, Diagnosis and Management of T1D


Typical symptoms of T1D include excess urine output (polyuria), thirst and increased fluid intake (polydypsia),blurred vision, and weight loss. When left untreated, this form of diabetes can lead to a build-up of ketone bodies and a reduction of blood pH (ketoacidosis), reducing mental faculties and causing a loss of consciousness (see Ekoé et al., 2002, p. 7).

Diabetes can be diagnosed by a single random11i.e. taken at any time of the day, as opposed to a fasting glucose test taken at least 8 hours after the last meal blood glucose test, as long as symptoms are present and blood glucose levels are found to be in excess (typically \(>11.1~{}mmol~{}l^{-1}\)) of those normally observed. In situations where symptoms are less obvious and/or glucose levels are at the high end of the normal range, a glucose tolerance test (GTT) is used for diagnosis. In this test, fasting patients have their blood glucose level tested, patients then consume a measured dose of oral glucose, and blood glucose levels are measured 2 hours later. A fasting glucose level in excess of \(6.1~{}mmol~{}l^{-1}\), or post-load level in excess of