Introduction
Huntsman Cancer Institute (HCI) of the University of Utah is the official Cancer Center of Utah. HCI serves as the only and most proximal National Cancer Institute-designated Cancer Center for much of the vast Mountain West region (Utah, Idaho, Montana, Nevada, and Wyoming), which encompasses 17% of the landmass of the continental United States. HCI provides patient and prevention education from three community clinics in the surrounding area, and six affiliate hospitals in neighboring states1. With a long history of germline genetics research, HCI maintains a large amount of variant data.
While storing variant data can be a simple matter, keeping that stored collection up-to-date is not. Mutation nomenclature versioning, reference sequences, and scientific discovery all contribute to the ever-changing nature of variant data. One fairly representative example can be seen in the evolving nature of the ARSA variants. A number of these variants are disease-causing and can lead to Metachromatic Leukodystrophy (MLD). A study by Cesani, et al. in 2015 focused on an updated recommendation in mutation nomenclature guidelines to ascribe the A of the first ATG translational initiation codon as nucleotide +1. They found that most ARSA variants had been reported before this recommendation was made and were described based on the processed mature protein, which differs from the translated protein in six nucleotides at the 5′-terminus of the cDNA sequence and two amino acids at the N-terminus. One example they cited was that the common autosomal dominant-causing variant that had been historically referred to as 1277C>T (Pro426Leu), should be named c.1283C>T (p.Pro428Leu) 2. This is one of the thousands of instances that highlight the arduous burden of keeping variant data current.
One popular software tool to support research on variant data is the Leiden Open Variation Database (LOVD) 3. The database was initially created in 2004, with LOVD 2.0 being released in 2007 and the current 3.0 version being released in 2012. This service provides local access to an immense amount of gene/disease annotations with the data all linked to a centralized online LOVD database. While local records can be added using submission templates, there is limited flexibility for custom columns, no support for merging records, no synonymous HGVS record detection, and limited history tracking of edits made to the data.
There are also a growing number of public genetic variant databases, which include insightful annotations.ClinVar 4, ClinGen 5, dbSNP 6, dbVar 7, HGMD 8, gnomAD 9, CIViC 10,OMIM 11, and COSMIC 12all fill particular niches in the field of medical genetics and have independent funding and partnerships. There have also been REST-based tools created to provide mapping services across the different identifiers used by these databases. The ClinGen Allele Registry 13 andMyVariant.info 14 are two predominant tools that offer this service. These tools allow interested parties to query the “current” knowledge about a particular variant and benefit from synonym detection and a rich result drawn from across the several databases. These public services are widely used by the research community but are single-query based and not designed for the longitudinal maintenance of local variant collections.
The Variation Representation Specification (VRS) is being developed by the Global Alliance for Genomics and Health (GA4GH)15. Currently in its second major version, 1.1, VRS makes several contributions: a terminology and information model that ensures the precise computational definitions for biological concepts in fields, semantics, objects, and object relationships; a machine-readable schema to enable language-agnostic tests for ensuring compliance to the information models; various conventions that promote reliable data sharing, such as fully justified allele normalization; globally unique computed identifiers that allow data providers and consumers to computationally generate consistent, globally unique identifiers for variation without a central authority; and a python implementation that demonstrates the proper implementation of the specification and facilitates the translation of existing variant representations into VRS. The addition of VRS identifiers to any variant collection will prove to be critical as the variant community moves toward a more computationally stringent system of linking variant knowledge and exchanging variant data across institutions.
This study had two main objectives. The first was to analyze the HCI variant dictionary and discover trends that might indicate needs not currently filled. The second was to create an open-source and institution-agnostic tool to address these needs and otherwise facilitate the management of variant collections.