2. COMPUTATIONAL METHODS
HCM patients with at least one episode of AF (n=191), either prior to their first clinic visit (n=139) or during follow up (n=52), were considered AF cases, and the remaining patients who were in sinus rhythm (n=640) were labeled as No-AF (Supplementary Figure 1).
Supplementary Figure 2 summarizes the computational framework (HCM-AF-Risk Model ) that we introduce for identifying HCM patients with AF. It comprises 5 steps: 1) preprocessing to remove variables directly correlated with AF, and to address missing data; 2) feature selection, in which informative, predictive clinical variables that distinguish AF cases from No-AF are identified; 3) association analysis to quantify the degree of association between each predictor variable and the AF class; 4) supervised machine learning for building and training classifiers and performing classification; and 5) thorough quantitative and qualitative evaluation to assess the classifier’s performance.
2.1 Preprocessing: We first removed variables that had no relevance to risk of AF (e.g. visit date, patient ID), as well as variables directly indicative of adverse outcomes (e.g. ventricular arrhythmia, heart failure, AF). The feature set remaining at the end of this step consisted of 93 clinical variables (Supplementary Table 1). As some of the records did not include values for all variables, data imputation was performed using a nearest neighbor approach (seeSupplementary file, Section B.1.1 for details).
2.2 Feature selection: When high-dimensional data is used for classifier training, the classifiers often exhibit low performance on the test set due to overfitting of the model to the specific training set. That is, certain features may show discriminating power within a limited dataset but not generalize beyond that small training set. Moreover, many of the features are not informative for distinguishing among the different classes (in this case AF vs No-AF records). We thus apply feature selection aiming to identify attributes most informative for AF, while reducing data dimensionality to avoid overfitting. We note that our dataset comprises both nominal and continuous attributes, also referred to as features. While the classification method presented here is multivariate, the feature selection is performed by assessing individual variables one at a time. Selection of highly predictive nominal attributes relies on the well-known Information Gain criterion [24] which measures the information gained about the AF-class given the value assumed by the attribute. For continuous features, we used the two sample t-test under unequal variance[25, 26], testing whether the distribution of attribute values associated with AF cases is significantly different from that associated with No-AF cases. The resulting reduced feature set employed here contains only those continuous attributes for which the t-test indicated a highly statistically significant distributional difference (p ≤0.01) , and those nominal attributes for which the information gain value was greater than 0.002 . The threshold value was determined through an iterative process in which the least informative feature is removed and left out of the classification procedure at each iteration. Feature-removal was repeated in this way until the classification performance indicated deterioration, at which point all the remaining features were retained. This feature selection process resulted in 18 clinical variables deemed to be informative and predictive of AF in HCM patients (Table 1).
2.3 Association Analysis: Many of the attributes gathered per patient are nominal, as opposed to continuous-numerical (Supplementary Table 1). Nominal features include variables such as HCM type orhistory of syncope . Association among nominal variables cannot be calculated using the standard Pearson correlation. Thus, to express the degree and direction of association between the predictor variables and the outcome variable, we employ the polychoric correlation[27, 28], which takes on values in the range [-1 ,1 ], where a negative value indicates negative association and a positive value corresponds to positive association.
2.4 Classification: Our classifier operates by taking as input a vector of values representing a patient’s record, and assigning a probability that indicates the patient’s likelihood to belong to the AF vs No-AF class. Each patient in our dataset is represented using the 18 distinguishing features identified in the feature-selection step. Specifically, each of the 831 patients, denoted \(p^{i}\) (1i831 ), is mapped to a 18 -dimensional vector,\(V^{i}=<p_{1}^{i},\ldots,p_{18}^{i}>,\) where each entry in the vector corresponds to the clinical value recorded for the respective variable. The classifier calculates for each 18 dimensional vector\(V^{i}\ \)representing the ith patient, its probability to be an AF case, Pr(AF| Vi) vs its probability to be No-AF, (Pr(No-AF | Vi) = 1- Pr(AF| Vi)) . The higher the value Pr(AF| Vi), the more likely the patient is to have AF. As an illustration, a probability of 0.9 to be an AF case assigned to patient p indicates a high risk for AF, while a probability of0.3 suggests that the risk for AF is much lower. We note that both the representation of the patients based on readily interpretable clinical values and the classification decision that corresponds to assigning a severity-probability are unique to this work. It stands in contrast to most recently published work in machine learning within the clinical domain[29-31] where a complex model architecture based on artificial neural networks is used, typically acting as a ‘black-box’ that provides the categorization of the patient without the ability to track down the justification or explanation.
For comparison among the performance of different potential candidate machine learning classifiers, we tested four standard (yet probabilistic and explainable) machine learning classification methods, namely, Logistic Regression, Naïve Bayes, Decision Tree and Random Forest, assessing how well they separate AF cases from No-AF (Supplementary Table 2). We used the Python scikit-learn package for training the baseline classifiers[32]. All four classifiers performed poorly when trained on our highly imbalanced dataset, failing to detect almost any AF records (Table 2). We have thus employed a method we have devised for addressing imbalance[14] by combining over- and under-sampling along with an ensemble classifier that integrates the most effective classifiers to separate AF records from No-AF records. The over- and under-sampling strategy is based on partitioning the training data, as shown in Supplementary Figure 3. For a more detailed description of the classification model and its testing see Supplementary Data Section B1.2 .
2.5 Model Evaluation: We employed several standard measures[24] to assess the performance of our HCM-AF-Risk Model , namely, specificity, sensitivity (recall) and area under receiver operating characteristics (ROC) curve. The first two are defined as: Specificity =\(\frac{\text{TN}}{\text{TN}+\text{FP}}\) , \(\ \) Sensitivity= \(\frac{\text{TP}}{\text{TP}+\text{FN}}\ \), where TP (True Positives) denotes AF records that are correctly labeled as AF by the classifier; TN (true negatives) denotes records that are not associated with AF and are not assigned to this class by the classifier; FP (false positives) denotes records not associated with AF that are misclassified by the classifier as AF; FN (false negatives) denotes AF records that were incorrectly labeled by the classifier as No-AF. The ROC curve plots the true positive rate (TPR), calculated as\(\frac{\text{TP}}{\text{TP}+\text{FN}}\) , as a function of false positive rate (FPR), calculated as\(\frac{\text{FP}}{\text{FP}+\text{TN}}\ \) (FP-false positive). The classifier performance is then reported based on the area under the ROC curve (C-index).
2.6 Experiments: We first employed univariate feature selection using information gain and the
two-sample t-test for unequal variance, to identify salient features that are associated with AF. We also employed polychoric correlation to determine the association between the pertinent features identified by the feature selection method and the outcome variable (AF/No-AF). We represented patients using the reduced set of features and trained simple classifiers as baseline, as described earlier. To address the data imbalance challenge, we applied the combination of under- and over-sampling to obtain a balanced training set, which was then used to train the four simple classifiers, and the ensemble classifier comprising logistic regression and naïve Bayes. Each of the classifiers was trained and tested using five-fold cross-validation, in which the data is partitioned into five equal subsets – and five iterations of learning are performed, each of which uses 80% (4 of the subsets) for training while the fifth (leftout) subset is used for testing. After each training iteration, we evaluated the performance of the resulting classifier on the imbalanced test set in terms of specificity, sensitivity and the area under the ROC curve. We ran 10 complete sets of 5-fold cross validation experiments, for a total of 50 runs; the performance reported is averaged over these 50 runs.
To address data imbalance we first applied methods that were previously reported in the literature, such as simple oversampling, simple under-sampling, Adaptive Synthetic Sampling Approach (ADASYN)[33] and Meta-classification[34] and found that the performance of our combined under- and over-sampling using SMOTE was superior. Hence, we report only the results obtained using our method (HCM-AF-Risk Model ) which combines under- and over-sampling, and compare this method against the baseline classifiers.
2.7 Comparisons: We also evaluated the performance attained by our model when trained on the dataset where it is represented based on three additional feature sets (Table 3, top three rows). Those feature sets correspond to sets of attributes that were reported as predictive in three seminal AF risk identification studies, namely the Framingham Heart Study,[8] ARIC[10] and CHARGE-AF Consortium[9].