Introduction
Huntsman Cancer Institute (HCI) of the University of Utah is the
official Cancer Center of Utah. HCI serves as the only and most proximal
National Cancer Institute-designated Cancer Center for much of the vast
Mountain West region (Utah, Idaho, Montana, Nevada, and Wyoming), which
encompasses 17% of the landmass of the continental United States. HCI
provides patient and prevention education from three community clinics
in the surrounding area, and six affiliate hospitals in neighboring
states1. With a long history of germline genetics
research, HCI maintains a large amount of variant data.
While storing variant data can be a simple matter, keeping that stored
collection up-to-date is not. Mutation nomenclature versioning,
reference sequences, and scientific discovery all contribute to the
ever-changing nature of variant data. One fairly representative example
can be seen in the evolving nature of the ARSA variants. A number
of these variants are disease-causing and can lead to Metachromatic
Leukodystrophy (MLD). A study by Cesani, et al. in 2015 focused on an
updated recommendation in mutation nomenclature guidelines to ascribe
the A of the first ATG translational initiation codon as nucleotide +1.
They found that most ARSA variants had been reported before this
recommendation was made and were described based on the processed mature
protein, which differs from the translated protein in six nucleotides at
the 5′-terminus of the cDNA sequence and two amino acids at the
N-terminus. One example they cited was that the common autosomal
dominant-causing variant that had been historically referred to as
1277C>T (Pro426Leu), should be named c.1283C>T
(p.Pro428Leu) 2. This is one of the thousands of
instances that highlight the arduous burden of keeping variant data
current.
One popular software tool to support research on variant data is the
Leiden Open Variation Database (LOVD) 3. The database
was initially created in 2004, with LOVD 2.0 being released in 2007 and
the current 3.0 version being released in 2012. This service provides
local access to an immense amount of gene/disease annotations with the
data all linked to a centralized online LOVD database. While local
records can be added using submission templates, there is limited
flexibility for custom columns, no support for merging records, no
synonymous HGVS record detection, and limited history tracking of edits
made to the data.
There are also a growing number of public genetic variant databases,
which include insightful annotations.ClinVar 4,
ClinGen 5, dbSNP 6,
dbVar 7, HGMD 8,
gnomAD 9, CIViC 10,OMIM 11, and COSMIC 12all fill particular niches in the field of medical genetics and have
independent funding and partnerships. There have also been REST-based
tools created to provide mapping services across the different
identifiers used by these databases. The ClinGen Allele
Registry 13 andMyVariant.info 14 are two predominant tools that
offer this service. These tools allow interested parties to query the
“current” knowledge about a particular variant and benefit from
synonym detection and a rich result drawn from across the several
databases. These public services are widely used by the research
community but are single-query based and not designed for the
longitudinal maintenance of local variant collections.
The Variation Representation Specification (VRS) is being developed by
the Global Alliance for Genomics and Health (GA4GH)15.
Currently in its second major version, 1.1, VRS makes several
contributions: a terminology and information model that ensures the
precise computational definitions for biological concepts in fields,
semantics, objects, and object relationships; a machine-readable schema
to enable language-agnostic tests for ensuring compliance to the
information models; various conventions that promote reliable data
sharing, such as fully justified allele normalization; globally unique
computed identifiers that allow data providers and consumers to
computationally generate consistent, globally unique identifiers for
variation without a central authority; and a python implementation that
demonstrates the proper implementation of the specification and
facilitates the translation of existing variant representations into
VRS. The addition of VRS identifiers to any variant collection will
prove to be critical as the variant community moves toward a more
computationally stringent system of linking variant knowledge and
exchanging variant data across institutions.
This study had two main objectives. The first was to analyze the HCI
variant dictionary and discover trends that might indicate needs not
currently filled. The second was to create an open-source and
institution-agnostic tool to address these needs and otherwise
facilitate the management of variant collections.