3.2 Algorithms comparison
The variation of the performance metrics across the ten simulation runs is illustrated by the box plots in Figure 4. Indeed, the small range of each boxplot indicated little sensitivity to pseudo-absence generation across the ten runs for both seasons. The mean values of the six performance metrics (accuracy, kappa, sensibility, sensitivity, specificity, F1 score and TSS) calculated from the 10-fold cross-validation were high (mean range: 0.81-0.99) for the 14 algorithms for both seasons (Fig. 4), showing good predictive performance. Based on the six performance metrics, the best model was the Random Forest (RF) for both seasons, with values ranging from 0.93 to 0.99.
When comparing the tuned RF and the stacking method during the dry season, the tuned RF approach had slightly but significantly better performance metrics compared to the stacking for the accuracy (mean: 0.978 vs. 0.970), the specificity (mean: 0.980 vs. 0.970), the F1 score (mean: 0.978 vs. 0.970) and the TSS (mean: 0.957 vs. 0.941) (Kruskal-Wallis test, p <0.05, Fig. 5a,d,e,f). However, no significant difference was observed between both methods for the wet season (Kruskal-Wallis test, p >0.05, Fig. 5). Given such low differences, both methods were used to generate predictions of the whales’ potential distribution for the wet and dry season separately.