Matching algorithm

When the server receives a matching request from an external node (external matching) or from a user wishing to match a specific patient against all other patients on the server (internal matching), the query triggers a matching algorithm, which computes the similarity between the query patient and all patients stored in the database. As for other MME implementations and per MME API specifications, patient similarity is measured by a similarity score between 0 (no matching features) and 1 (exact matching of all patient’s features). The maximum number of patients returned by the server is a parameter which can be customized by editing the ”MAX_RESULTS” field in the app settings. The default value for this parameter is 5. Patient matches are returned in order of descending similarity with the query patient (i.e., high similarity matches are presented first in the list of results). Similarity score computation in PatientMatcher is taking into account genomic similarity and phenotype similarity across patients. The weight of these factors is numerically evaluated into a GTScore and a PhenoScore, where the sum of these two contributes to the total similarity score (result score) between query and matched patient. The relative importance of GTScore and PhenoScore in the computation can be customized by the server administrator by modifying the values of the parameters named ”MAX_GT_SCORE” and ”MAX_PHENO_SCORE” in the app configuration settings. The default value for both these parameters is 0.5, meaning an equivalent impact of phenotype and genotype similarity on the result score. This design was made to address diverse requirements from different data contributors. For example, a clinical laboratory might be storing patient genetic information with little availability of diagnoses or phenotype terms. In that case it makes sense to set the weight of the phenotype matching to zero and rely on genotype matching only. On the other hand, country regulations might not allow sharing of accurate genetic information, for instance variant details, but only gene symbols. If detailed patient diagnoses are also available for these patients, using both GTScore and PhenoScore when running the similarity algorithm will increase the chances of producing meaningful matches.

2.1.1 Genotype Matching Algorithm

When the parameter MAX_GT_SCORE is set to a valuer higher than zero and the query data contains genotype features (gene or variant information), a genotype similarity score will be evaluated between query patient and every patient (matched patient) contained in the database. All patients matching at least one of the candidate genes present in the query will be initially selected as matches. As specified in the MME API, candidate genes should preferably be described by an Ensembl ID (i.e., “ENSG00000101680”), but it is possible to search the database using patients with genes represented by HGNC symbols (i.e., “LAMA1 ”) and Entrez IDs (i.e., “6481”). The algorithm is designed to assign higher matching scores to patients described by fewer genotype features. For instance, a query patient described by a unique gene (A) that matches a database patient described by the same gene (A) will produce a higher genotype score than a query patient described by two genes (A and B). Genotype score (GT_SCORE) is quantified by the formula:
GT_SCORE = MAX_GT_SCORE / ∑fs
This number is calculated by dividing the MAX_GT_SCORE by the sum of the feature scores (fs) measured from the matching of each genotype feature of query patient against a matching patient. For example, according to this definition, assuming a MAX_GT_SCORE of 0.5, each gene from a patient described by 3 genes will have a fs of a third of 0.5 (0.1666). If a gene from the query patient does not match any gene of the matched patient, then the fs for that feature would have a value of 0. In the eventuality of exact matching of gene and gene variant, the fs would be assigned with the highest possible value for the feature (0.1666). Incomplete gene matches (gene matching and no variant matching or no variant metadata available for the provided genes) are assigned with an arbitrary value or a quarter of the fs for the feature (0.1666/4). By calculating the GT_SCORE in this manner, the algorithm produces an accurate numerical estimate of the similarity between all genotype features of matching patients. This, in turn, allows the server to return patient hits sorted by descending genetic similarity with the query patient and not simply all patients that match any of its genes.
PatientMatcher also provides the possibility to evaluate and assign scores to matching variants that are outside genes. Feature scores from variant matching outside genes are assigned with the same fs as exact (gene + variant) matchings. It is worth mentioning that the genotype matching algorithm contains a liftover functionality that allows to quantify the similarity between patients containing genomic features described in different genome builds.

2.1.2 Phenotype Matching Algorithm

PatientMatcher is calculating phenotype matching scores based on both patient features and disorders. Patient features are described by HPO terms (Köhler et al., 2014) provided for query and matched patients, while disorders are represented by Decipher (V. Firth et al., 2009), OMIM (Hamosh et al., 2000) or Orphanet (Pavan et al., 2017) entries. If patients to be compared contain features and disorders, these descriptors will be both accounted for and each of them will contribute to half of the resulting phenotype score (PHENO_SCORE). Similarity between HPO features will be solely considered in the computation when disorders are not provided for one or both patients. Whereas disease terms comparison in the algorithm is still relatively unpolished (it consists in a pairwise comparison of diagnoses between the patients), semantic similarity metrics between HPO terms and their ancestor terms are calculated as simGIC measures(Pesquita, 2007; Pesquita et al., 2008). The original algorithm used for creating the phenotype ontology and comparing the patients in PatientMatcher is available in the Patient-Similarity package (https://github.com/buske/patient-similarity). Since the HPO is curating resources bridging disease terms with their associated HPO entries, we envision that in future software releases disease similarity comparisons will be also calculated as semantic relationships between terms.

2.2 Email notifications

Email notifications can be enabled by administrators via specific parameters present in the software configuration file. In order to modulate the amount of information included in the email notification body and thereby limit the extent of potentially sensitive information distributed via email, there exists two notification options: 1) complete notifications containing the entire description of matching patients (including gene names, variants and phenotypes), and 2) partial notifications report with only patient IDs and patient’s clinician’s contact information. Email notifications are sent to the patient contact only in case of positive matches from requests triggered by the same user, by another user within PatientMatcher (internal matches) or an external user from an MME connected node (external matches).