2. COMPUTATIONAL METHODS
HCM patients with at least one episode of AF (n=191), either prior to
their first clinic visit (n=139) or during follow up (n=52), were
considered AF cases, and the remaining patients who were in sinus rhythm
(n=640) were labeled as No-AF (Supplementary Figure 1).
Supplementary Figure 2 summarizes the computational framework
(HCM-AF-Risk Model ) that we introduce for identifying HCM
patients with AF. It comprises 5 steps: 1) preprocessing to remove
variables directly correlated with AF, and to address missing data; 2)
feature selection, in which informative, predictive clinical variables
that distinguish AF cases from No-AF are identified; 3) association
analysis to quantify the degree of association between each predictor
variable and the AF class; 4) supervised machine learning for building
and training classifiers and performing classification; and 5) thorough
quantitative and qualitative evaluation to assess the classifier’s
performance.
2.1 Preprocessing: We first removed variables that had no
relevance to risk of AF (e.g. visit date, patient ID), as well as
variables directly indicative of adverse outcomes (e.g. ventricular
arrhythmia, heart failure, AF). The feature set remaining at the end of
this step consisted of 93 clinical variables (Supplementary Table
1). As some of the records did not include values for all variables,
data imputation was performed using a nearest neighbor approach (seeSupplementary file, Section B.1.1 for details).
2.2 Feature selection: When high-dimensional data is used for
classifier training, the classifiers often exhibit low performance on
the test set due to overfitting of the model to the specific training
set. That is, certain features may show discriminating power within a
limited dataset but not generalize beyond that small training set.
Moreover, many of the features are not informative for distinguishing
among the different classes (in this case AF vs No-AF records).
We thus apply feature selection aiming to identify attributes most
informative for AF, while reducing data dimensionality to avoid
overfitting. We note that our dataset comprises both nominal and
continuous attributes, also referred to as features. While the
classification method presented here is multivariate, the feature
selection is performed by assessing individual variables one at a time.
Selection of highly predictive nominal attributes relies on the
well-known Information Gain criterion [24] which measures the
information gained about the AF-class given the value assumed by the
attribute. For continuous features, we used the two sample t-test under
unequal variance[25, 26], testing whether the distribution of
attribute values associated with AF cases is significantly different
from that associated with No-AF cases. The resulting reduced feature set
employed here contains only those continuous attributes for which the
t-test indicated a highly statistically significant distributional
difference (p ≤0.01) , and those nominal attributes for which the
information gain value was greater than 0.002 . The threshold
value was determined through an iterative process in which the least
informative feature is removed and left out of the classification
procedure at each iteration. Feature-removal was repeated in this way
until the classification performance indicated deterioration, at which
point all the remaining features were retained. This feature selection
process resulted in 18 clinical variables deemed to be informative and
predictive of AF in HCM patients (Table 1).
2.3 Association Analysis: Many of the attributes gathered per
patient are nominal, as opposed to continuous-numerical (Supplementary
Table 1). Nominal features include variables such as HCM type orhistory of syncope . Association among nominal variables cannot be
calculated using the standard Pearson correlation. Thus, to express the
degree and direction of association between the predictor variables and
the outcome variable, we employ the polychoric correlation[27, 28],
which takes on values in the range [-1 ,1 ], where a
negative value indicates negative association and a positive value
corresponds to positive association.
2.4 Classification: Our classifier operates by taking as input
a vector of values representing a patient’s record, and assigning a
probability that indicates the patient’s likelihood to belong to the AF
vs No-AF class. Each patient in our dataset is represented using the 18
distinguishing features identified in the feature-selection step.
Specifically, each of the 831 patients, denoted \(p^{i}\) (1 ≤i ≤ 831 ), is mapped to a 18 -dimensional vector,\(V^{i}=<p_{1}^{i},\ldots,p_{18}^{i}>,\) where each entry in the
vector corresponds to the clinical value recorded for the respective
variable. The classifier calculates for each 18 dimensional vector\(V^{i}\ \)representing the ith patient, its
probability to be an AF case, Pr(AF|
Vi) vs its probability to be No-AF, (Pr(No-AF
| Vi) = 1- Pr(AF|
Vi)) . The higher the value Pr(AF|
Vi), the more likely the patient is to have AF. As an
illustration, a probability of 0.9 to be an AF case assigned to
patient p indicates a high risk for AF, while a probability of0.3 suggests that the risk for AF is much lower. We note that
both the representation of the patients based on readily interpretable
clinical values and the classification decision that corresponds to
assigning a severity-probability are unique to this work. It stands in
contrast to most recently published work in machine learning within the
clinical domain[29-31] where a complex model architecture based on
artificial neural networks is used, typically acting as a ‘black-box’
that provides the categorization of the patient without the ability to
track down the justification or explanation.
For comparison among the performance of different potential candidate
machine learning classifiers, we tested four standard (yet probabilistic
and explainable) machine learning classification methods, namely,
Logistic Regression, Naïve Bayes, Decision Tree and Random Forest,
assessing how well they separate AF cases from No-AF (Supplementary
Table 2). We used the Python scikit-learn package for training the
baseline classifiers[32]. All four classifiers performed poorly when
trained on our highly imbalanced dataset, failing to detect almost any
AF records (Table 2). We have thus employed a method we have devised for
addressing imbalance[14] by combining over- and under-sampling along
with an ensemble classifier that integrates the most effective
classifiers to separate AF records from No-AF records. The over- and
under-sampling strategy is based on partitioning the training data, as
shown in Supplementary Figure 3. For a more detailed description of the
classification model and its testing see Supplementary Data
Section B1.2 .
2.5 Model Evaluation: We employed several standard
measures[24] to assess the performance of our HCM-AF-Risk
Model , namely, specificity, sensitivity (recall) and area under
receiver operating characteristics (ROC) curve. The first two are
defined as: Specificity =\(\frac{\text{TN}}{\text{TN}+\text{FP}}\) , \(\ \) Sensitivity= \(\frac{\text{TP}}{\text{TP}+\text{FN}}\ \), where TP (True
Positives) denotes AF records that are correctly labeled as AF by the
classifier; TN (true negatives) denotes records that are not associated
with AF and are not assigned to this class by the classifier; FP (false
positives) denotes records not associated with AF that are misclassified
by the classifier as AF; FN (false negatives) denotes AF records that
were incorrectly labeled by the classifier as No-AF. The ROC curve plots
the true positive rate (TPR), calculated as\(\frac{\text{TP}}{\text{TP}+\text{FN}}\) , as a function of false
positive rate (FPR), calculated as\(\frac{\text{FP}}{\text{FP}+\text{TN}}\ \) (FP-false positive). The
classifier performance is then reported based on the area under the ROC
curve (C-index).
2.6 Experiments: We first employed univariate feature selection
using information gain and the
two-sample t-test for unequal variance, to identify salient features
that are associated with AF. We also employed polychoric correlation to
determine the association between the pertinent features identified by
the feature selection method and the outcome variable (AF/No-AF). We
represented patients using the reduced set of features and trained
simple classifiers as baseline, as described earlier. To address the
data imbalance challenge, we applied the combination of under- and
over-sampling to obtain a balanced training set, which was then used to
train the four simple classifiers, and the ensemble classifier
comprising logistic regression and naïve Bayes. Each of the classifiers
was trained and tested using five-fold cross-validation, in which the
data is partitioned into five equal subsets – and five iterations of
learning are performed, each of which uses 80% (4 of the subsets) for
training while the fifth (leftout) subset is used for testing. After
each training iteration, we evaluated the performance of the resulting
classifier on the imbalanced test set in terms of specificity,
sensitivity and the area under the ROC curve. We ran 10 complete sets of
5-fold cross validation experiments, for a total of 50 runs; the
performance reported is averaged over these 50 runs.
To address data imbalance we first applied methods that were previously
reported in the literature, such as simple oversampling, simple
under-sampling, Adaptive Synthetic Sampling Approach (ADASYN)[33]
and Meta-classification[34] and found that the performance of our
combined under- and over-sampling using SMOTE was superior. Hence, we
report only the results obtained using our method (HCM-AF-Risk
Model ) which combines under- and over-sampling, and compare this method
against the baseline classifiers.
2.7 Comparisons: We also evaluated the performance attained by
our model when trained on the dataset where it is represented based on
three additional feature sets (Table 3, top three rows). Those feature
sets correspond to sets of attributes that were reported as predictive
in three seminal AF risk identification studies, namely the Framingham
Heart Study,[8] ARIC[10] and CHARGE-AF Consortium[9].