MATERIALS ANd METHODS
Data acquisition and
preparation
We collected all records available in the Watkins Marine Mammal Database
website listed under the “all cuts” page. We limited the analysis to
37 marine mammal species by discarding data for species with a low
number of audio samples; we processed 17.1 hours of audio. For each
audio file in the WMD the associated metadata included a label for the
sound sources present in the recording (biological, anthropogenic, and
environmental), as well as information related to the location and date
of recording. We selected audio clips that contained a marine mammal as
the main and only sound source present in the recording and labelled the
vocalizations according to taxonomic group (Odontocetae,
Mysticetae, Otariidae, and Phocidae ), order, family, and
species.
We created an additional label defining the population of origin for the
orca (Orcinus orca ) samples, which split them into five groups.
The first three, EN Atlantic, WN Atlantic and EN Pacific ,
are recordings of orcas in the wild. EN Atlantic samples include
orcas recorded in the Norwegian Sea and in a Norwegian fjord. WN
Atlanti c samples include orcas recorded outside St. John’s Harbour
(Newfoundland, Canada) and orcas recorded approximately 130 km south of
Martha’s Vineyard (Massachusetts, U.S.). The EN Atlantic andWN Atlantic samples most likely contain a mix of two orca
ecotypes (T1 and T2). EN Pacific samples included whales recorded
in Saanich Inlet (British Columbia, Canada) and Dabob Bay (Washington,
U.S.). These recordings could belong to three orca ecotypes (i.e.,
resident, offshore, and transient). The last two labels, EN
Atlantic - captive and EN Pacific - captive , indicate recordings
of captive whales Moby Doll, a resident orca captured in British
Columbia, and Keiko, captured in Iceland (either a T1 or a T2 ecotype).
The Placentia Bay Database includes recordings collected by Fisheries
and Oceans Canada at multiple stations within Placentia Bay
(Newfoundland, Canada), from 2017 to 2020. From the PBD, we selected
three days of recordings from summer 2019.
The first two days (2019/08/10
and 2019/10/10) were collected by an AMAR G4 hydrophone (sensitivity:
-165.02 dB re 1V/µPa at 250 Hz) deployed at 65 m of depth, approximately
13 km south of the town of Burin. The third day of recordings was
collected by an AMAR G4 hydrophone (sensitivity: -164.92 dB re 1V/µPa at
250 Hz) deployed at 100 m of depth, approximately 2 km south of Red
Island. Both hydrophones were set to operate following 15 min cycles,
with the first 60 s sampled at 512 kHz, and the remaining 14 min sampled
at 64 kHz. For the purpose of this study, we limited the analysis to the
64 kHz recordings. From the Burin deployment, we selected the
10th of August as it contained seismic airgun noise
from oil and gas exploration activity happening in the Grand Banks,
approximately 170 km south of the hydrophone deployment location. From
the Red Island deployment, we selected the 26th of
July, which contained ship transits and humpback whale vocalizations.
Before proceeding with the analysis, all recordings were labelled by
time stamp and location. All days contained humpback whale
vocalizations.
Acoustic feature
extraction
The audio files from the WMD and PBD databases were used as input for
VGGish (Abu-El-Haija et al., 2016; Simonyan & Zisserman, 2014), a CNN
developed and trained to perform general acoustic classification. VGGish
was trained on the Youtube8M dataset, containing more than two million
user-labelled audio-video files. Rather than focusing on the final
output of the model (i.e., the assigned labels), here the model was used
as a feature extractor (Sethi et al., 2020). VGGish converts audio input
into a semantically meaningful vector consisting of 128 features. The
model returns features at multiple resolution: ~1 s (960
ms); ~1 min (59’520 ms); ~5 min (299’520
ms). All of the visualizations and results pertaining to the WMD were
prepared using the finest feature resolution of ~1 s.
The visualizations and results pertaining to the PBD were prepared using
the ~1 min features, except for the humpback whale
detection test, which was conducted on the ~1 s
features.
UMAP ordination and
visualization
To allow for data visualization and to reduce the 128 features to two
dimensions for further analysis, we applied Uniform Manifold
Approximation and Projection (UMAP) to both datasets in full, and
inspected the resulting plots (Figs. 2 to 7). UMAP is a non-linear
dimensionality reduction algorithm based on the concept of topological
data analysis which, unlike other dimensionality reduction techniques
(e.g., tSNE), preserves both the local and global structure of
multivariate datasets (McInnes et al., 2018).
The UMAP algorithm generates a low-dimensional representation of a
multivariate dataset while maintaining the relationships between points
in the global dataset structure (i.e., the 128 features extracted from
VGGish). Each point in a UMAP plot in this paper represents an audio
sample with duration of either ~ 1 sec or
~ 1 min. Each point in the two-dimensional UMAP space
also represents a vector of 128 VGGish features. The nearer two points
are in the plot space, the nearer the two points are in the
128-dimensional space, and thus the distance between two points in UMAP
reflects the degree of similarity between two audio samples in our
datasets. Areas with a high density of samples in UMAP space should,
therefore, contain sounds with similar characteristics, and such
similarity should decrease with increasing point distance. The
visualizations and classification trials presented here illustrate how
the two techniques (VGGish and UMAP) can be used together for marine
ecoacoustics analysis.
Labelling sound sources
Sample labels were obtained with a mix of techniques: labels for the WMD
records were obtained from the database metadata; for the PBD
recordings, the start and end of seismic exploration was identified
through manual inspection, ship presence was inferred from sound
pressure levels (SPL) in the ship noise band (40-315 Hz) (Baldwin et
al., 2021), and humpback whale presence was inferred using an acoustic
detection model (Allen et al., 2021).
To label anthropogenic noise sources in the PBD, we first used PAMGuide
(Merchant et al., 2015) to process the acoustic recordings. We computed
broadband SPL (dB re 1 µPa) between 50 and 4,000 Hz (1 min resolution)
as a global measure of sound pressure level in the dataset. As an
indicator of ship noise, we computed the SPL between 40 and 315 Hz
(i.e., ship band hereafter) at 1 min resolution. The ship band
encompasses the 63, 150, and 250 Hz 1/3 octave bands (Baldwin et al.,
2021), which are indicators of low-frequency ship noise levels
(Merchant, et al., 2014). Samples that satisfied the following two
conditions were considered as ship presences:
1) the ship band SPL was within
12 dB of the broadband SPL; 2) the 5 min mean ship band SPL was 3 dB
above the global median SPL (i.e., computed on the full dataset)
(Appendix S2.2). PBD samples collected near Burin on 08/10/2019 were
inspected to identify the start and end of seismic airgun activity. All
1-min samples with a time stamp falling within the period of seismic
exploration were marked as airgun noise present and contained multiple
blasts.
Biological noise sources in the PBD recordings were processed using the
humpback whale acoustic detector created by NOAA and Google (Allen et
al., 2021), providing a model score for every ~1 s
sample. The model returns scores ranging from 0 to 1 indicating the
confidence in the predicted humpback whale presence. We used the results
of this detection model to label the PBD samples according to presence
of humpback whale vocalizations. We selected 0.8 as the minimum model
score needed to declare a humpback present, while every sample with a
score lower than 0.8 was labelled as an absence.
Label prediction
performance
To predict labels from the acoustic features for both the WMD and the
PBD datasets, we applied nested k-fold cross validation to a random
forest model, with ten-folds in the outer loop, and five-folds in the
inner loop. We selected nested cross validation as it allows model
optimization of hyperparameters and performance evaluation in a single
step. Models were trained either on the two UMAP dimensions, or on the
full set of 128 acoustic features, depending on model performance. Model
performance was evaluated using two metrics: F1 and balanced-accuracy
scores, both on a scale from 0 to 1. The F1 score combines model recall
and precision, favouring models with a high score in both metrics
(Chinchor, 1992). Balanced accuracy is suited for measuring model
performance when samples are highly imbalanced, and represents the
average recall obtained for each model class (Brodersen et al., 2010).
When the F1 and balanced accuracy scores indicated poor performance of
the classifier, we repeated the trial using the 128 acoustic features
instead of the two UMAP dimensions.
In total, we conducted 13 trials on the two databases, six on the WMD,
and seven on the PBD (Table 1). The first WMD trial included building a
classifier for Mysticete , Odontocete , and Pinniped .
For the remaining five trials, we created subsets of the WMD and ran
classifiers for: three Mysticete (Balaenidae ,Balaenopteridae , and Eschrichtiidae ) and fourOdontocete families (Delphinidae , Monodontidae ,Phocoenidae , and Physteridae ); threeBalaenopteridae species (minke, fin, and humpback whales), 14Delphinidae species (see Appendix S1 for a complete list); and
three distinct orca populations. Species with less than 100 samples were
removed from the analysis.
Trials on the PBD labels proceeded as follows: i) classification of
hydrophone locations (i.e., Burin and Red Island); classification of
anthropogenic noise sources, including ii-iii) seismic airguns and iv-v)
ships; and presence of humpback whales using vi) the two UMAP dimensions
and vii) the 128 acoustic features, respectively. Presences represented
a very small fraction of the PBD (<0.003 %), leading to high
class imbalance. We used two strategies to reduce class imbalance: we
selected a subset of the PBD containing only hours with at least ten
presences (this reduced the PBD dataset to 19 hours of PAM recordings);
and then implemented a balanced random forest classifier (Lemaître et
al., 2017) in place of the model used for the previous trials.