Bootstrap Distillation: Non-parametric Internal Validation of GWAS Results by Subgroup Resampling

David A. Eccles, Rodney A. Lea and Geoffrey K. Chambers


Genome-wide Association Studies are carried out on a large number of genetic variants in a large number of people, allowing the detection of small genetic effects that are associated with a trait. Natural variation of genotypes within populations means that any particular sample from the population may not represent the true genotype frequencies within that population. This may lead to the observation of marker-disease associations when no such association exists.

A bootstrap population sub-sampling technique can reduce the influence of allele frequency variation in producing false-positive results for particular samplings of the population. In order to utilise bioinformatics in the service of a serious disease, this sub-sampling method has been applied to the Type 1 Diabetes dataset from the Wellcome Trust Case Control Consortium in order to evaluate its effectiveness.

While previous literature on Type 1 Diabetes has identified some DNA variants that are associated with the disease, these variants are not informative for distinguishing between disease cases and controls using genetic information alone (AUC=0.7284). Population sub-sampling filtered out noise from genome-wide association data, and increased the chance of finding useful associative signals. Subsequent filtering based on marker linkage and testing of marker sets of different sizes produced a 5-SNP signature set of markers for Type 1 Diabetes. The group-specific markers used in this set, primarily from the HLA region on chromosome 6, are considerably more informative than previously known associated variants for predicting T1D phenotype from genetic data (AUC=0.8395). Given this predictive quality, the signature set may be useful alone as a screening test, and would be particularly useful in combination with other clinical cofactors for Type 1 Diabetes risk.



Personalised medical treatment based on genome profiles is a major goal of genetic research in the \(21^{st}\) century (see Avery et al., 2009; Province et al., 2008). However, complex genotype-environment interactions for common diseases make it difficult to determine which specific genetic features should be used to construct such profiles. Hence the prediction of genetic risk is a major challenge of the \(21^{st}\) century.

The introduction of large-scale Single Nucleotide Polymorphism (SNP) genotyping systems has enabled genetic variants to be typed en-masse, shifting the main effort required in a genetic risk study from genotyping to data analysis (or bioinformatics). Here we investigate genetic markers for Type 1 Diabetes (T1D), demonstrating how a population sub-sampling method may assist in the identification of risk markers for a complex disease.

Type 1 Diabetes


Type 1 Diabetes mellitus (T1D) is a disorder typically characterised by an absence of insulin-producing beta cells in the pancreas, either through loss of the cells themselves, or through the reduction in capacity of the cells to produce insulin (see Atkinson et al., 2014). This disorder shares with the more common Type 2 Diabetes mellitus (T2D) a characteristic symptom of high blood glucose levels. In some cases, this glucose also passes through to the urine, creating a sticky/sweet substance that attracts ants (see Ekoé et al., 2002, pp. 7,11). In T2D, this high blood glucose is caused by cells not responding to insulin (insulin resistance), while in T1D the excess is caused by a reduction in insulin production (insulin dependence).

The incidence of T1D varies throughout the world, with rates of incidence as low as 0.0006% per year in China, 0.02% in the UK, up to nearly 0.05% per year in Finland. About 50-60% of cases of T1D manifest in childhood (younger than 18 years), and the disease is believed to be caused by an abnormal immune response after exposure to environmental triggers such as viruses, toxins or food (see Daneman, 2006). While a spring birth is correlated with T1D risk, the diagnosis of Type 1 Diabetes is more common in autumn and winter (see Atkinson et al., 2014).

Symptoms, Diagnosis and Management of T1D


Typical symptoms of T1D include excess urine output (polyuria), thirst and increased fluid intake (polydypsia),blurred vision, and weight loss. When left untreated, this form of diabetes can lead to a build-up of ketone bodies and a reduction of blood pH (ketoacidosis), reducing mental faculties and causing a loss of consciousness (see Ekoé et al., 2002, p. 7).

Diabetes can be diagnosed by a single random11i.e. taken at any time of the day, as opposed to a fasting glucose test taken at least 8 hours after the last meal blood glucose test, as long as symptoms are present and blood glucose levels are found to be in excess (typically \(>11.1~{}mmol~{}l^{-1}\)) of those normally observed. In situations where symptoms are less obvious and/or glucose levels are at the high end of the normal range, a glucose tolerance test (GTT) is used for diagnosis. In this test, fasting patients have their blood glucose level tested, patients then consume a measured dose of oral glucose, and blood glucose levels are measured 2 hours later. A fasting glucose level in excess of \(6.1~{}mmol~{}l^{-1}\), or post-load level in excess of \(11.1~{}mmol~{}l^{-1}\) is considered diagnostic for both forms of Diabetes Mellitus. Type 1 Diabetes (as distinct from T2D) encompasses a range of diseases that involve autoimmunity. It can be diagnosed by the presence of antibodies to glutamic acid decarboxylase, islet cells, insulin, or ICA512 (see Ekoé et al., 2002, p. 19).

As the symptoms of T1D are caused by high blood glucose levels (hyperglycaemia) due to a lack of insulin, these symptoms can be relieved by the introduction of insulin into the blood. This is typically carried out by supplying measured doses of insulin via intramuscular injections or by the use of insulin pumps (see Daneman, 2006). Individuals with T1D need a constant supply of insulin for survival, together with occasional insulin bursts to control variable blood glucose levels throughout the day (e.g. after meals). In contrast, individuals with T2D only require insulin for survival in rare cases (see Ekoé et al., 2002, p. 16). Slow-release insulin and consumption of foods with a low glycaemic index can help to reduce the extremes of T1D symptoms.

Improperly managed treatment can cause further medical complications in a diabetic patient. Too much insulin, excessive physical activity, or not enough dietary sugar can result in low blood glucose levels (hypoglycaemia), which produce short-term autonomic and neurological problems such as trembling, dizziness, blurred vision, and difficulty concentrating. Hypoglycaemia is treated either by ingestion of sugar, or by intravenous glucose in severe cases (see Daneman, 2006).

Complications of T1D


The initial symptoms of T1D are not usually severe, and the disease may progress for a few years before a diagnosis is made and treatment is given. However, long-term complications can appear when the disease is not managed appropriately (see Ekoé 2002, p. 8). Retinal damage progresses in about 20-25% of individuals with T1D, with later stages causing retinal detachment and consequent loss of sight. Renal failure is also a problem in diabetic individuals, which is indicated by high urinary protein levels. When individuals have these high levels, progression to end-stage renal disease occurs in about 50% of cases. Neural defects are also a potential complication of T1D, most commonly damage to peripheral nerves, leading to ulceration, poor healing and gangrene unless good care is taken of the body extremities (see Daneman, 2006).

Genetic Contribution to T1D Risk


Type 1 Diabetes has a heritability of around 88% (Hyttinen 2003), indicating that a substantial proportion of variance in disease susceptibility can be attributed to genetic factors. About 50% of the genetic contribution to T1D can be accounted for by variation in the HLA region on chromosome 6, and 15% is accounted for by variation in two other genes, IDDM2 and IDDM12 (see Daneman, 2006). Incidence rates in migrant populations quickly converge to those of the background population, suggesting that although the genetic contribution to the disease is high, environmental factors probably play a significant role in triggering the onset of disease (see Daneman, 2006).

Wellcome Trust Case Control Consortium Study


The Wellcome Trust Case Control Consortium (WTCCC, was established in 2005 to identify novel genetic variants associated with seven common diseases, including Type 1 Diabetes (Wellcome Trust Case Control Consortium 2007). 2000 individuals with T1D, and 1500 individuals from the National Blood Service (NBS)22The study also typed 2000 individuals for each of the six other diseases: a total of 14,000 cases genotyped for seven diseases. were genotyped for the WTCCC using an Affymetrix GeneChip 500k Mapping Array Set.

The Wellcome Trust Case Control Consortium (2007) reported associations near five gene regions that had been previously associated with T1D: The major histocompatibility complex (MHC) on chromosome 6, CTLA4 and IFIH1 on chromosome 2, PTPN22 on chromosome 1, and IL2RA on chromosome 10. The insulin gene (INS) on chromosome 11 was also associated with T1D; the only SNP tagging INS failed quality control filters, but also indicated strong association with T1D when examined. A number of other regions showed evidence of association with T1D in the Wellcome Trust Case Control Consortium (2007) study: 4q27 (chromosome 4); 10p15 (chromosome 10); 12p13, 12q13 and 12q24 (chromosome 12) 16p13 (chromosome 16); and 18p11 (chromosome 18). Most of these regions include genes involved in the immune system. However, only two genes are in 16p13, and both have unknown functions (KIAA0350 and dexamethasone-induced transcript). The strongest association signal for T1D was detected within the HLA region of chromosome 6, a region in which multiple SNPs had strong associations with T1D, but only one of those SNPs (rs9272346) was reported in the results table of the strongest associations (see Wellcome Trust Case Control Consortium, 2007, table 3).

Replication Issues in GWAS


The Genome-wide Association Study (GWAS) is a common method for discovering genetic contributions to complex human diseases. The outcome of these studies is to determine the degree of association between single genetic markers and a heritable trait. Commonly, an analysis is carried out on a large number of genetic variants in a large number of people, allowing the detection of small genetic effects that are associated with a trait. In recent years, an initial search for variants is carried out by whole-genome sequencing in a small sub-population to identify variants that are common in the population of interest.

A study style that is built around correlation and association rather than a hunt for causal variants requires extreme care to ensure that observed associations are valid and causal. Studies need to have good within-study validation to reduce the likelihood of false-positive results being obtained and treated as true associations, and need to be supported by good independent validation. The distinction between association and causation is important – GWAS are used as hypothesis-generating tools to narrow down, through association, the search for potential causative loci. After the associations have been validated, it is expected that they will be followed up with studies attempting to determine the true causative status of that association. Such causative studies are difficult, and progress towards understanding the aetiology of common disease has been slow (see Dermitzakis et al., 2009).

Sampling Errors in GWAS


Natural variation of genotypes within populations means that any particular sample from the population may not represent the true genotype frequencies within that population. This may lead to the observation of marker-disease associations when no such association exists. This is particularly important when considering populations with mixed ancestry, where markers that are informative for distinguishing population ancestry may become accidentally associated with a particular disease (see Pritchard et al., 2001).

Bootstrapping by repeated re-sampling of a representative draw made from a group can estimate population variation in genotype frequencies by observing variation within the sub-samples. A re-sampling technique, as presented here, can reduce the influence of allele frequency variation by excluding false-positive results that are specific for particular samplings of the population.