Supplementary Figure 1. Flow chart illustrating identification
of AF and No-AF cases in HCM cohort
Supplementary Figure 2. HCM-AF-Risk Model Schematic :
The overall framework for identification of atrial fibrillation cases
using clinical attributes within electronic health records of HCM
patients (HCM-AF-Risk Model ). In the data preprocessing step,
variables known to be non-informative with respect to AF, and variables
associated with adverse outcomes (e.g. heart failure, ventricular
arrhythmia, stroke) are removed. The feature selection step identifies
the most informative clinical variables for separating AF cases from
No-AF cases. Next, the degree of association between each predictor
variable and the AF class is identified via association analysis.
Supervised machine learning is then used to build classifiers and
perform classification. Last, a thorough evaluation, both qualitative
and quantitative was performed to assess the classifier’s performance.
Supplementary Figure 3. Methods for addressing data imbalance:An illustration of our classification scheme for combining over- and
under-sampling. The topmost layer represents the entire training set,
which comprises a majority of No-AF records (shown on the left) and the
minority of AF records (shown on the right). The majority class in the
training set (No-AF) is randomly under-sampled such that the No-AF to AF
record ratio is 2:1. The minority class (AF) is over-sampled usingSMOTE to generate synthetic new AF-like records, doubling the
number of AF records. The resulting set forms a balanced training set,
containing the same number of AF and No-AF records.