1. Model performance
The performance of all models is summarized in Figures 2-4 and Table 5-6. At the macro-averaged level, the ensemble model performed better than either submodel individually within each classification pass (Table 5). The addition of random artificial noise in submodel 2 improved both precision and recall in pass 1, though only recall in pass 2 (Table 5). The ensemble model of pass 2 performed substantially better than the corresponding model in pass 1 (Figure 2), likely both due to the larger training dataset used for this pass and the fact that the training and validation datasets used during this pass were both comprised of audio collected by us in the field, thus being more similar to one another than they were in pass 1. This increased similarity between training and validation datasets in pass 2 is also a potential explanation for the observed decrease in recall score with added artificial noise during this pass, though we did not perform further analysis of this specific result. Per-class performance was generally good, with visible improvements from pass 1 to pass 2 in most, though species with subjectively more variable vocalizations (e.g. T. major ) performed less well (Figure 3, Table 6). Intriguingly the increase in classification accuracy we observed at the macro-averaged level did not hold uniformly true at a class level, with submodel 1 or 2 often yielding better results (Table 6). An analysis of classifier score distributions for positive detections showed increased score separation between true positive and false positive detections in pass 2 relative to pass 1 (Figure 4), indicating better overall predictive power in the case of the latter model (Knight et al. 2017). We also observed that our chosen score threshold yielded precision and recall values that were close to the inflection point of the precision-recall curve, indicating this value was an appropriate choice for ensuring a good balance of the two metrics.