Discussion
Single call annotation, whether manual or via recognisers, is a viable
alternative to acoustic indices for monitoring ecological restoration
(Linke and Deretic, 2020). While recognisers are commonly treated as one
analysis class, there is a gradient in both effort and performance of
auto-detectors. This ranges from largely automated recognisers –
typically built-in software packages such as ‘Kaleidoscope’ (Wildlife
Acoustics, 2017) – to completely custom-built software (Towsey et al.,
2012). In all cases, various parameters alter recogniser performance;
these may be left as defaults in software or manipulated by the
end-user. Differences in recogniser construction alters performance and
this can manifest as poor agreement among recognisers built using
different software (Lemen et al., 2015). Relying on recognisers without
properly understanding how they operate can be problematic (Russo and
Voigt, 2016). In this study, we took a semi-custom approach; we used a
pre-programmed matching algorithm (Towsey et al., 2012; Ulloa et al.,
2016) within the R package monitoR (Katz et al., 2016b), but actively
investigated three important parameters that are often overlooked – or,
at least, are rarely reported on - in recogniser construction. These
parameters were call template selection and representativeness, template
construction (including amplitude cut-off) and the threshold of
similarity at which a detection is returned (score cut-off). We argue
that there is a need to establish thorough construction and evaluation
mechanisms for building recognisers, and for these to be properly
reported in the literature.
First, choices pertaining to call template selection are crucial (Katz
et al., 2016a; Teixeira et al., 2022). Studies typically report the
source of call templates (e.g. whether calls were collected from wild or
captive animals), but usually fail to explain the decisions underlying
the selection of the exact calls used. For example, were calls free of
background noise – and how did this affect recogniser performance?
Animal calls exist not in isolation, but within an overall soundscape.
As such, representing calls within the context of the soundscapes that
we seek to monitor may be important. While our call recognisers perform
well overall (Table 2), they are also prone to species-specific errors.
For example, L. tasmaniensis recognisers produce false positives
for rain events, whereas erroneous detections of C.
parinsignifera mainly found birds and insects (Table 3).
In this study, we attempted to represent common background noises, such
other species’ calls and non-biological sounds (e.g. running water).
Although we selected calls that were relatively clear in their
structure, we maintained a ‘buffer’ (or a margin) around each selection,
in both the time and frequency domains. Since any manual selection of
candidate calls will incur a level of human bias, we chose to extract
between 100 and 200 templates per species, from which a minimum of 10
were tested and only two or three were chosen for the final recogniser.
Although for some rare or cryptic species, call templates can be
difficult to acquire, we argue that, as much as possible, recognisers
should be built following the testing of many candidate templates.
Another important consideration is the representativeness of species’
call types and behaviours (Priyadarshani et al., 2018). For species that
exhibit large vocal repertoires, decisions must be made about the call
types to feature in recognisers. This should be driven by a program’s
objectives or research questions; for example, monitoring breeding may
require only one or two breeding-associated call types to feature in the
recogniser (Teixeira et al., 2019). Further, geographic variation in
call structure (e.g. regional dialects) may also impact recogniser
performance, and should be investigated when recognisers are intended
for use at spatial scales over which call types may vary (Kahl et al.,
2021; Lauha et al., 2022; Priyadarshani et al., 2018). If recognisers
are used among discrete or isolated populations, call templates may need
to represent each area. In this study, we attempted to represent
inter-site variability by selecting candidate call templates from every
site where the species was recorded. For several species, the final
recognisers comprised templates from more than one site.
Once call templates are chosen, decisions must be made about their
construction for use in a recogniser. In binary point matching, call
templates are created from a grid of on and off points (i.e. call and
non-call points), which are manipulated by the amplitude cut-off set by
the user (Katz et al., 2016b). In monitoR, the impact of altering
amplitude cut-off can be easily visualised (Figure 1). In this study, we
manipulated amplitude cut-off to show both the call structure and some
background noise. Since the recogniser ‘matches’ both the on and off
points, finding a suitable balance between these is important. Although
visualising and selecting amplitude cut-off is a manual and somewhat
arbitrary process, we considered that the large sample size of candidate
templates tested in this study would minimise any bias from this
process. However, for studies that test a smaller number of candidate
templates, we recommend that each template is tested at several
different amplitude cut-offs.
Finally, an appropriate score cut-off, which sets the threshold of
similarity at which a detection is returned (Figure 2), must be set for
each call template. Score cut-off alters the template’s sensitivity and
therefore, greatly affects performance. A higher score cut-off will
reduce false positive detections, but may increase false negatives (Katz
et al., 2016a). Conversely, increasing sensitivity by lowering score
cut-off will reduce false negatives, but it may reduce precision by
returning more false positives. Here, we tested every call template at
score cut-off increments of 0.2 from a low of 3, and measured
performance by ROC value. For most species examined, high ROC values
indicated that call templates were able to sufficiently trade-off false
positives and false negatives while maximising true positives. This
rigorous approach to score cut-off testing allowed us to set highly
specific cut-offs in the final recognisers. However, for species that
are rarer or more cryptic, returning sufficient true positives may
require a lower score cut-off with a poorer ROC value. Where detecting
most, if not all, calls is important, other performance metrics like
recall should be given due consideration. Ultimately, decisions about
score cut-off should be driven by a study’s objectives, but we argue
that general metrics like ROC values are a good starting point in most
cases.
We argue that ecoacoustic researchers and practitioners need to stop
treating recognisers like a black box and actively develop, improve and
test processes that help evaluation. From the literature, it is
currently unclear how reliable recognisers are. Many studies report poor
performance but this may be more a function of inappropriate
construction, rather than recognition methods per se. Especially
recogniser testing is often ignored and recogniser performance is
reported by number of detections in a larger dataset. Even when
performance is reported, it is often unclear what the source of low
recogniser accuracy is. We demonstrated that this could have multiple
causes, from badly selected templates to a lack of template calibration,
for example amplification or detection cut-offs. We recommend that
recognisers are not treated as a static product. They can be refined and
adapted as more monitoring data become available. Using this study as an
example, we are currently working on a refinement for the recogniser forL. tasmaniensis that is based on better template recordings. A
complete recommended workflow could start with a recogniser built for a
particular species in a particular region, then enhanced by data from
other environments, followed by a performance evaluation and refinement
as necessary.