Julia Olivieri

and 2 more

Background/IntroductionHLA disease association studies - what is known?HLA-IMP2Challenges with HLAResultsSingle variant association analysisCausal allele identification (BMA)Non-additive effectsInteraction effectsHLA-HLA interactionsHLA-non-HLA interactionsMethodsPhenotypingChris to doAlleletypingIn order to establish causality between common diseases and HLA allelotypes, we first need data about the HLA allelotypes present in a group of individuals. The data provided by the UK Biobank includes imputed HLA allelotypes for all individuals. The HLA region has traditionally resisted imputation due to the region's extensive genetic variation; for example, HLA-B has dozens of common alleles and more than 2000 rare alleles. Also, the HLA region manifests unusually high linkage disequilibrium, which means that common models used to impute other regions are not well-suited for HLA. In the UK Biobank study, HLA alleles HLA-A, -B, -C, -DRB5, -DRB4, -DRB3, -DRB1, -DQB1, -DQA1 were imputed using HLA*IMP:02, which is based on a graphical model of the haplotype structure of the MHC region. HLA*IMP:02 uses a wider variety of reference panels to allow more accurate imputation, and takes into account haplotype uncertainty. To verify that the imputed HLA alleles were accurate, the UK Biobank study included a standard association analysis using logistic-regression on a white British ancestry subset on 11 immune-mediated diseases known to have connections to the HLA region (the allele pair for each individual was inferred to be the pair of alleles that had the highest posterior probability from the HLA*IMP:02 analysis). In each of these studies, the HLA allele with the strongest signal was consistent with the literature.signal was consistent with the literature.Julia to do Processing and filtering of HLA genotypesAfter alleletyping, the  UK Biobank reported dosage values for 362 HLA allelotypes across 11 HLA genes, collected from 488,378 individuals. After subsetting down to only the white British cohort, we were left with [num] individuals. We excluded all allelotypes that were imputed to be nonzero fewer than six times in our analysis, which left us with 312 allelotypes (50 were excluded). We then rounded dosages within 0.1 of an integer, and marked the remaining nonzero entries as missing data (so, for example, a dosage value of 0.93 would become 1, while a value of 1.32 would be marked missing). If there was no missing data for an allele and yet the sum for an individual over that allele did not equal 2, we recorded all nonzero values for that individual/allele combination individual/allele combination as missing data. Because the true value for each individual is integral (because a person either has a given allelotype or does not have it), this serves the purpose of making our data more representative of the true information, letting us throw out data that we are not confident in, and results in a format that can be easily fed into popular regression software such as PLINK.Julia to doAssociation analysisTo find nominal p-values associated with each phenotype/allele variant combination we ran PLINK analyses on [phenotype number] phenotypes across all 362 HLA allelotypes. We also ran three types of logistic regression in R: a logistic regression using the unrounded dosage file, a logistic regression using the rounded dosage file, and a regression where the dosages are treated as factors, allowing for non-additive analysis.Julia to doCausal allele identificationTo find allelotypes that are casually related to the phenotypes we are considering we took a Bayesian Model Averaging approach using the bma package in R. In this method a model is trained on each possible subset of the given allelotypes, and the results of each sufficiently significant model are reported, along with the probability that the given model is correct. Because this approach is computationally intractable for too large a set of allelotypes (it requires analyzing 2n models for each allelotype), we decided that for each phenotype we would perform the BMA analysis on the 10 allelotypes with the lowest p-values from the PLINK analysis. We chose the PLINK p-values as opposed to the R p-values because the R p-values are capped at 2-16, and for some phenotypes we observed more than 10 allelotypes with p-values less than or equal to this value.Bayesian model average etc.Julia to doAdditivity analysisChris to doInteraction effects