MATERIALS ANd METHODS

Data acquisition and preparation

We collected all records available in the Watkins Marine Mammal Database website listed under the “all cuts” page. We limited the analysis to 37 marine mammal species by discarding data for species with a low number of audio samples; we processed 17.1 hours of audio. For each audio file in the WMD the associated metadata included a label for the sound sources present in the recording (biological, anthropogenic, and environmental), as well as information related to the location and date of recording. We selected audio clips that contained a marine mammal as the main and only sound source present in the recording and labelled the vocalizations according to taxonomic group (Odontocetae, Mysticetae, Otariidae, and Phocidae ), order, family, and species.
We created an additional label defining the population of origin for the orca (Orcinus orca ) samples, which split them into five groups. The first three, EN Atlantic, WN Atlantic and EN Pacific , are recordings of orcas in the wild. EN Atlantic samples include orcas recorded in the Norwegian Sea and in a Norwegian fjord. WN Atlanti c samples include orcas recorded outside St. John’s Harbour (Newfoundland, Canada) and orcas recorded approximately 130 km south of Martha’s Vineyard (Massachusetts, U.S.). The EN Atlantic andWN Atlantic samples most likely contain a mix of two orca ecotypes (T1 and T2). EN Pacific samples included whales recorded in Saanich Inlet (British Columbia, Canada) and Dabob Bay (Washington, U.S.). These recordings could belong to three orca ecotypes (i.e., resident, offshore, and transient). The last two labels, EN Atlantic - captive and EN Pacific - captive , indicate recordings of captive whales Moby Doll, a resident orca captured in British Columbia, and Keiko, captured in Iceland (either a T1 or a T2 ecotype).
The Placentia Bay Database includes recordings collected by Fisheries and Oceans Canada at multiple stations within Placentia Bay (Newfoundland, Canada), from 2017 to 2020. From the PBD, we selected three days of recordings from summer 2019. The first two days (2019/08/10 and 2019/10/10) were collected by an AMAR G4 hydrophone (sensitivity: -165.02 dB re 1V/µPa at 250 Hz) deployed at 65 m of depth, approximately 13 km south of the town of Burin. The third day of recordings was collected by an AMAR G4 hydrophone (sensitivity: -164.92 dB re 1V/µPa at 250 Hz) deployed at 100 m of depth, approximately 2 km south of Red Island. Both hydrophones were set to operate following 15 min cycles, with the first 60 s sampled at 512 kHz, and the remaining 14 min sampled at 64 kHz. For the purpose of this study, we limited the analysis to the 64 kHz recordings. From the Burin deployment, we selected the 10th of August as it contained seismic airgun noise from oil and gas exploration activity happening in the Grand Banks, approximately 170 km south of the hydrophone deployment location. From the Red Island deployment, we selected the 26th of July, which contained ship transits and humpback whale vocalizations. Before proceeding with the analysis, all recordings were labelled by time stamp and location. All days contained humpback whale vocalizations.

Acoustic feature extraction

The audio files from the WMD and PBD databases were used as input for VGGish (Abu-El-Haija et al., 2016; Simonyan & Zisserman, 2014), a CNN developed and trained to perform general acoustic classification. VGGish was trained on the Youtube8M dataset, containing more than two million user-labelled audio-video files. Rather than focusing on the final output of the model (i.e., the assigned labels), here the model was used as a feature extractor (Sethi et al., 2020). VGGish converts audio input into a semantically meaningful vector consisting of 128 features. The model returns features at multiple resolution: ~1 s (960 ms); ~1 min (59’520 ms); ~5 min (299’520 ms). All of the visualizations and results pertaining to the WMD were prepared using the finest feature resolution of ~1 s. The visualizations and results pertaining to the PBD were prepared using the ~1 min features, except for the humpback whale detection test, which was conducted on the ~1 s features.

UMAP ordination and visualization

To allow for data visualization and to reduce the 128 features to two dimensions for further analysis, we applied Uniform Manifold Approximation and Projection (UMAP) to both datasets in full, and inspected the resulting plots (Figs. 2 to 7). UMAP is a non-linear dimensionality reduction algorithm based on the concept of topological data analysis which, unlike other dimensionality reduction techniques (e.g., tSNE), preserves both the local and global structure of multivariate datasets (McInnes et al., 2018).
The UMAP algorithm generates a low-dimensional representation of a multivariate dataset while maintaining the relationships between points in the global dataset structure (i.e., the 128 features extracted from VGGish). Each point in a UMAP plot in this paper represents an audio sample with duration of either ~ 1 sec or ~ 1 min. Each point in the two-dimensional UMAP space also represents a vector of 128 VGGish features. The nearer two points are in the plot space, the nearer the two points are in the 128-dimensional space, and thus the distance between two points in UMAP reflects the degree of similarity between two audio samples in our datasets. Areas with a high density of samples in UMAP space should, therefore, contain sounds with similar characteristics, and such similarity should decrease with increasing point distance. The visualizations and classification trials presented here illustrate how the two techniques (VGGish and UMAP) can be used together for marine ecoacoustics analysis.

Labelling sound sources

Sample labels were obtained with a mix of techniques: labels for the WMD records were obtained from the database metadata; for the PBD recordings, the start and end of seismic exploration was identified through manual inspection, ship presence was inferred from sound pressure levels (SPL) in the ship noise band (40-315 Hz) (Baldwin et al., 2021), and humpback whale presence was inferred using an acoustic detection model (Allen et al., 2021).
To label anthropogenic noise sources in the PBD, we first used PAMGuide (Merchant et al., 2015) to process the acoustic recordings. We computed broadband SPL (dB re 1 µPa) between 50 and 4,000 Hz (1 min resolution) as a global measure of sound pressure level in the dataset. As an indicator of ship noise, we computed the SPL between 40 and 315 Hz (i.e., ship band hereafter) at 1 min resolution. The ship band encompasses the 63, 150, and 250 Hz 1/3 octave bands (Baldwin et al., 2021), which are indicators of low-frequency ship noise levels (Merchant, et al., 2014). Samples that satisfied the following two conditions were considered as ship presences: 1) the ship band SPL was within 12 dB of the broadband SPL; 2) the 5 min mean ship band SPL was 3 dB above the global median SPL (i.e., computed on the full dataset) (Appendix S2.2). PBD samples collected near Burin on 08/10/2019 were inspected to identify the start and end of seismic airgun activity. All 1-min samples with a time stamp falling within the period of seismic exploration were marked as airgun noise present and contained multiple blasts.
Biological noise sources in the PBD recordings were processed using the humpback whale acoustic detector created by NOAA and Google (Allen et al., 2021), providing a model score for every ~1 s sample. The model returns scores ranging from 0 to 1 indicating the confidence in the predicted humpback whale presence. We used the results of this detection model to label the PBD samples according to presence of humpback whale vocalizations. We selected 0.8 as the minimum model score needed to declare a humpback present, while every sample with a score lower than 0.8 was labelled as an absence.

Label prediction performance

To predict labels from the acoustic features for both the WMD and the PBD datasets, we applied nested k-fold cross validation to a random forest model, with ten-folds in the outer loop, and five-folds in the inner loop. We selected nested cross validation as it allows model optimization of hyperparameters and performance evaluation in a single step. Models were trained either on the two UMAP dimensions, or on the full set of 128 acoustic features, depending on model performance. Model performance was evaluated using two metrics: F1 and balanced-accuracy scores, both on a scale from 0 to 1. The F1 score combines model recall and precision, favouring models with a high score in both metrics (Chinchor, 1992). Balanced accuracy is suited for measuring model performance when samples are highly imbalanced, and represents the average recall obtained for each model class (Brodersen et al., 2010). When the F1 and balanced accuracy scores indicated poor performance of the classifier, we repeated the trial using the 128 acoustic features instead of the two UMAP dimensions.
In total, we conducted 13 trials on the two databases, six on the WMD, and seven on the PBD (Table 1). The first WMD trial included building a classifier for Mysticete , Odontocete , and Pinniped . For the remaining five trials, we created subsets of the WMD and ran classifiers for: three Mysticete (Balaenidae ,Balaenopteridae , and Eschrichtiidae ) and fourOdontocete families (Delphinidae , Monodontidae ,Phocoenidae , and Physteridae ); threeBalaenopteridae species (minke, fin, and humpback whales), 14Delphinidae species (see Appendix S1 for a complete list); and three distinct orca populations. Species with less than 100 samples were removed from the analysis.
Trials on the PBD labels proceeded as follows: i) classification of hydrophone locations (i.e., Burin and Red Island); classification of anthropogenic noise sources, including ii-iii) seismic airguns and iv-v) ships; and presence of humpback whales using vi) the two UMAP dimensions and vii) the 128 acoustic features, respectively. Presences represented a very small fraction of the PBD (<0.003 %), leading to high class imbalance. We used two strategies to reduce class imbalance: we selected a subset of the PBD containing only hours with at least ten presences (this reduced the PBD dataset to 19 hours of PAM recordings); and then implemented a balanced random forest classifier (Lemaître et al., 2017) in place of the model used for the previous trials.