INTRODUCTION
Abrupt changes in the ocean environment are increasing in frequency as
climate change accelerates (Ainsworth et al., 2020), resulting in loss
of key ecosystems (Sully et al., 2019), and shifts in endangered
species’ distributions (Plourde et al., 2019). Detecting such changes
requires both historical and real-time (or near-real time) data made
readily available to managers and decision-makers. Scientists and
practitioners are being tasked with finding efficient solutions for
monitoring environmental health and detecting incipient change (Gibb et
al., 2019; Kowarski & Moors-Murphy, 2020). This challenge includes
monitoring for changes in species’ presence, abundance, distribution,
and behaviour (Durette-Morin et al., 2019; Fleming et al., 2018;
Root-Gutteridge et al., 2018), monitoring anthropogenic activity and
disturbance levels (Gómez et al., 2018), monitoring the physical
environment (Almeira & Guecha, 2019), and detecting harmful events
(Rycyk et al., 2020), among others.
Environmental sounds provide a proxy to investigate ecological processes
(Gibb et al., 2019; Rycyk et al., 2020), including exploring complex
interactions between anthropogenic activity and biota (Erbe et al.,
2019; Kunc et al., 2016). Sound provides useful information on
environmental conditions and ecosystem health, allowing, for example,
the rapid identification of disturbed coral reefs (Elise et al., 2019).
In concert, numerous species (i.e., birds, mammals, fish, and
invertebrates) rely on acoustic communication for foraging, mating and
reproduction, habitat use and other ecological functions (Eftestøl et
al., 2019; Kunc & Schmidt, 2019; Luo et al., 2015; Schmidt et al.,
2014). Noise produced by anthropogenic activities (e.g., vehicles,
stationary machinery, explosions) can interfere with such processes,
affecting the health and reproductive success of multiple marine taxa
(Kunc & Schmidt, 2019). In response to concerns about noise pollution,
increasing effort is being invested in developing, testing, and
implementing noise management measures in both terrestrial and marine
environments. Consequently, Passive Acoustic Monitoring (PAM) has become
a mainstream tool in biological monitoring (Gibb et al., 2019). PAM
represents a set of techniques that are used for the systematic
collection of acoustic recordings for environmental monitoring. It
allows collecting large amounts of environmental information at multiple
locations and over extended periods of time.
One of PAM’s most common applications is in marine mammal monitoring and
conservation. Marine mammals produce complex vocalizations that are
species-specific (if not individually unique), and such vocalizations
can be used when estimating species’ distributions and habitat use
(Durette-Morin et al., 2019; Kowarski & Moors-Murphy, 2020). PAM
applications in marine mammal research span from the study of their
vocalizations and behaviours (Madhusudhana et al., 2019; Vester et al.,
2017) to assessing anthropogenic disturbance (Nguyen Hong Duc et al.,
2021). PAM datasets can reach considerable sizes, particularly when
recorded at high sampling rates, and projects often rely on experts to
manually inspect the acoustic recordings for the identification of
sounds of interest (Nguyen Hong Duc et al., 2021). For projects
involving recordings collected over multiple months at different
locations, conducting a manual analysis of the entire dataset can be
prohibitive, and often only a relatively small portion of the acoustic
recordings is subsampled for analysis.
At its core, studying acoustic environments is a signal detection and
classification problem in which a large number of spatially and
temporally overlapping acoustic energy sources need to be differentiated
to better understand their relative contributions to the soundscape.
Such an analytical process, termed acoustic scene classification (Geiger
et al., 2013), is a key step in analysing environmental information
collected by PAM recorders. Acoustic scenes can contain multiple
overlapping sound sources, which generate complex combinations of
acoustic events (Geiger et al., 2013). This definition overlaps with the
ecoacoustics definition of soundscape (Farina & Gage, 2017), providing
a bridge between the two fields, where a soundscape represents the total
acoustic energy contained within an environment and consists of three
intersecting sound sources: geological (i.e., geophony), biological
(i.e., biophony), and anthropogenic (i.e., anthrophony). A goal of
ecoacoustics is to understand how these sources interact and influence
each other, with a particular focus on biological-anthropogenic acoustic
interactions.
Automated acoustic analysis can overcome some of the limitations
encountered in manual PAM analysis, allowing ecoacoustics researchers to
explore full datasets (Houegnigan et al., 2017). Deep learning
represents a novel set of computer-based artificial intelligence
approaches which has profoundly changed biology and ecology research
(Christin et al., 2019). Among the deep learning approaches,
Convolutional Neural Networks (CNNs) have demonstrated high accuracy in
performing image classification tasks, including the classification of
spectrograms (i.e., visual representations of sound intensity across
time and frequency) (Hershey et al., 2017; LeBien et al., 2020).
CNNs have been applied successfully to several ecological problems, and
their use in ecology has been growing (Christin et al., 2019), such as
to process camera trap images to identify species, age classes, numbers
of animals, and to classify behaviour patterns (Lumini et al., 2019;
Norouzzadeh et al., 2018; Tabak et al., 2019). CNN’s algorithms perform
well for acoustic classification (Hershey et al., 2017), including the
identification of a growing number of species vocalizations such as
crickets and cicadas (Dong et al., 2018), birds and frogs (LeBien et
al., 2020), fish (Mishachandar & Vairamuthu, 2021), and lately marine
mammals (Usman et al., 2020). The latter include training neural
networks for detecting North Atlantic right whale calls using a mix of
real and synthetic data (Padovese et al., 2021), and the classification
of sperm whale clicks (Bermant et al., 2019). Most CNN applications
focus on species detection rather than a broader characterization of the
acoustic environment. Furthermore, automated acoustic analysis
algorithms often rely on supervised classification based on large
datasets of known sounds (i.e., training datasets) used to train
acoustic classifiers; creating training datasets is time-consuming and
requires expert-driven manual classification of the acoustic data
(Bittle & Duncan, 2013).
Recent developments in acoustic scene analysis demonstrate how the
implementation of acoustic feature sets derived from CNNs, along with
the use of dimensionality reduction, can improve our ability to
understand ecoacoustics datasets while providing a common ground for
analysing recordings collected across multiple environments and temporal
scales (Clink & Klinck, 2020; Mishachandar & Vairamuthu, 2021; Sethi
et al., 2020). Our study explores the application of acoustic scene
analysis to two sets of PAM recordings containing marine mammal
vocalizations (Fig 1). The first dataset, the Watkins Marine
Mammal Sound Database (Woods Hole Oceanographic Institution and the New
Bedford Whaling Museum; WMD hereafter), allowed us to test if acoustic
features can be used for classifying marine mammal vocalizations
according to multiple levels of taxonomic organization. The second
dataset, which consisted of approximately 72 hours of recordings
collected by Fisheries and Oceans Canada at two different locations
within Placentia Bay (Newfoundland, Canada; PBD hereafter), allowed us
to test if the acoustic features can be used to identify different sound
sources, namely ships, seismic airguns, and humpback whales
(Megaptera novaengliae ).