Supplementary Figure 1. Flow chart illustrating identification of AF and No-AF cases in HCM cohort
Supplementary Figure 2. HCM-AF-Risk Model Schematic : The overall framework for identification of atrial fibrillation cases using clinical attributes within electronic health records of HCM patients (HCM-AF-Risk Model ). In the data preprocessing step, variables known to be non-informative with respect to AF, and variables associated with adverse outcomes (e.g. heart failure, ventricular arrhythmia, stroke) are removed. The feature selection step identifies the most informative clinical variables for separating AF cases from No-AF cases. Next, the degree of association between each predictor variable and the AF class is identified via association analysis. Supervised machine learning is then used to build classifiers and perform classification. Last, a thorough evaluation, both qualitative and quantitative was performed to assess the classifier’s performance.
Supplementary Figure 3. Methods for addressing data imbalance:An illustration of our classification scheme for combining over- and under-sampling. The topmost layer represents the entire training set, which comprises a majority of No-AF records (shown on the left) and the minority of AF records (shown on the right). The majority class in the training set (No-AF) is randomly under-sampled such that the No-AF to AF record ratio is 2:1. The minority class (AF) is over-sampled usingSMOTE to generate synthetic new AF-like records, doubling the number of AF records. The resulting set forms a balanced training set, containing the same number of AF and No-AF records.