Matching algorithm
When the server receives a matching request from an external node
(external matching) or from a user wishing to match a specific patient
against all other patients on the server (internal matching), the query
triggers a matching algorithm, which computes the similarity between the
query patient and all patients stored in the database. As for other MME
implementations and per MME API specifications, patient similarity is
measured by a similarity score between 0 (no matching features) and 1
(exact matching of all patient’s features). The maximum number of
patients returned by the server is a parameter which can be customized
by editing the ”MAX_RESULTS” field in the app settings. The default
value for this parameter is 5. Patient matches are returned in order of
descending similarity with the query patient (i.e., high similarity
matches are presented first in the list of results). Similarity score
computation in PatientMatcher is taking into account genomic
similarity and phenotype similarity across patients. The weight of these
factors is numerically evaluated into a GTScore and a PhenoScore, where
the sum of these two contributes to the total similarity score (result
score) between query and matched patient. The relative importance of
GTScore and PhenoScore in the computation can be customized by the
server administrator by modifying the values of the parameters named
”MAX_GT_SCORE” and ”MAX_PHENO_SCORE” in the app configuration
settings. The default value for both these parameters is 0.5, meaning an
equivalent impact of phenotype and genotype similarity on the result
score. This design was made to address diverse requirements from
different data contributors. For example, a clinical laboratory might be
storing patient genetic information with little availability of
diagnoses or phenotype terms. In that case it makes sense to set the
weight of the phenotype matching to zero and rely on genotype matching
only. On the other hand, country regulations might not allow sharing of
accurate genetic information, for instance variant details, but only
gene symbols. If detailed patient diagnoses are also available for these
patients, using both GTScore and PhenoScore when running the similarity
algorithm will increase the chances of producing meaningful matches.
2.1.1 Genotype Matching
Algorithm
When the parameter MAX_GT_SCORE is set to a valuer higher than zero
and the query data contains genotype features (gene or variant
information), a genotype similarity score will be evaluated between
query patient and every patient (matched patient) contained in the
database. All patients matching at least one of the candidate genes
present in the query will be initially selected as matches. As specified
in the MME API, candidate genes should preferably be described by an
Ensembl ID (i.e., “ENSG00000101680”), but it is possible to search the
database using patients with genes represented by HGNC symbols (i.e.,
“LAMA1 ”) and Entrez IDs (i.e., “6481”). The algorithm is
designed to assign higher matching scores to patients described by fewer
genotype features. For instance, a query patient described by a unique
gene (A) that matches a database patient described by the same gene (A)
will produce a higher genotype score than a query patient described by
two genes (A and B). Genotype score (GT_SCORE) is quantified by the
formula:
GT_SCORE = MAX_GT_SCORE / ∑fs
This number is calculated by dividing the MAX_GT_SCORE by the sum of
the feature scores (fs) measured from the matching of each genotype
feature of query patient against a matching patient. For example,
according to this definition, assuming a MAX_GT_SCORE of 0.5, each
gene from a patient described by 3 genes will have a fs of a third of
0.5 (0.1666). If a gene from the query patient does not match any gene
of the matched patient, then the fs for that feature would have a value
of 0. In the eventuality of exact matching of gene and gene variant, the
fs would be assigned with the highest possible value for the feature
(0.1666). Incomplete gene matches (gene matching and no variant matching
or no variant metadata available for the provided genes) are assigned
with an arbitrary value or a quarter of the fs for the feature
(0.1666/4). By calculating the GT_SCORE in this manner, the algorithm
produces an accurate numerical estimate of the similarity between all
genotype features of matching patients. This, in turn, allows the server
to return patient hits sorted by descending genetic similarity with the
query patient and not simply all patients that match any of its genes.
PatientMatcher also provides the possibility to evaluate and assign
scores to matching variants that are outside genes. Feature scores from
variant matching outside genes are assigned with the same fs as exact
(gene + variant) matchings. It is worth mentioning that the genotype
matching algorithm contains a liftover functionality that allows to
quantify the similarity between patients containing genomic features
described in different genome builds.
2.1.2 Phenotype Matching
Algorithm
PatientMatcher is calculating phenotype matching scores based on both
patient features and disorders. Patient features are described by HPO
terms (Köhler et al., 2014) provided for query and matched patients,
while disorders are represented by Decipher (V. Firth et al., 2009),
OMIM (Hamosh et al., 2000) or Orphanet (Pavan et al., 2017) entries. If
patients to be compared contain features and disorders, these
descriptors will be both accounted for and each of them will contribute
to half of the resulting phenotype score (PHENO_SCORE). Similarity
between HPO features will be solely considered in the computation when
disorders are not provided for one or both patients. Whereas disease
terms comparison in the algorithm is still relatively unpolished (it
consists in a pairwise comparison of diagnoses between the patients),
semantic similarity metrics between HPO terms and their ancestor terms
are calculated as simGIC measures(Pesquita, 2007; Pesquita et al.,
2008). The original algorithm used for creating the phenotype ontology
and comparing the patients in PatientMatcher is available in the
Patient-Similarity package
(https://github.com/buske/patient-similarity). Since the HPO is
curating resources bridging disease terms with their associated HPO
entries, we envision that in future software releases disease similarity
comparisons will be also calculated as semantic relationships between
terms.
2.2 Email notifications
Email notifications can be enabled by administrators via specific
parameters present in the software configuration file. In order to
modulate the amount of information included in the email notification
body and thereby limit the extent of potentially sensitive information
distributed via email, there exists two notification options: 1)
complete notifications containing the entire description of matching
patients (including gene names, variants and phenotypes), and 2) partial
notifications report with only patient IDs and patient’s clinician’s
contact information. Email notifications are sent to the patient contact
only in case of positive matches from requests triggered by the same
user, by another user within PatientMatcher (internal matches) or an
external user from an MME connected node (external matches).