UMAP label prediction performance
Model evaluation scores were above 0.7 for all of the WMD trials (Table 1), but with varying results depending on the specific label. The best classification results were obtained for Balanopteridae species (F1 = 0.998; balanced accuracy = 0.987), while the classifier built forDelphinidae species had the lowest performance (F1 = 0.829; balanced accuracy = 0.703). Classification accuracy varied across trials. For example, in the first trial, most Mysticete and Odontocete samples were correctly labelled, while 59% of thePinniped samples were mislabelled. In the second trial, 99%, 74%, and 71% of the Balaenopteridae, Eschrichtiidae , andBalaenidae samples were correctly classified. Of the fourOdontocede families, Physteridae , Delphinidae , andPhocoenidae , 99%, 90%, and 78% of the samples were correctly classified, respectively. Only 56% of the testing samples for the family Monodontidae were classified correctly.
Table 1. k-fold nested cross-validation input and results. The table reports model features (X), labels (Y), and evaluation metrics (F1 score, Balanced Accuracy score). Best models, model hyperparameters, and scores per run can be found in appendix S1.
All of the three Balaenoptera species considered in the study were correctly classified in the vast majority of cases, with scores equal or above 98% of correct predictions. Eight of the 14Delphinidae species had 80% or more correct label predictions. Of the four labels tested for orcas, correct labels ranged from 87% (WN Atlantic ) to 92% (EN Atlantic ), except for theEN Pacific labels, with only 33% of the labels guessed correctly. Both model performance metrics reflected such class imbalances, with lower scores for models containing a mix of labels with low and high prediction accuracy. Balanced-accuracy scores provided a more conservative metric and were more sensitive to class imbalance than the F1 scores.

Placentia Bay Dataset