Plain Language Summary
Surface waves generated by earthquakes carry valuable information about
Earth’s subsurface and the sources that generate them. To reliably and
robustly extract information from a suite of surface waveforms, the
signals require quality control screening. This process has typically
been done by experts labeling each data sample visually, which is
time-consuming and tedious for large datasets. To speed up signal
quality assessment, we trained machine learning methods using a large
set of human-labeled waveforms. We compared five techniques: logistic
regression, support vector machines, k-nearest neighbors, random
forests, and artificial neural networks. The artificial neural networks
performed the best and achieved an accuracy of 92%. Once trained, the
neural network model matched human performance but reduced the time cost
by 99.5% when applied to data it had never seen. Our analyses
demonstrate the capability of automated processing to improve quality in
surface-wave-related measurements without human quality control
screening.
1 Introduction
Surface waves have long been used for subsurface imaging (e.g., Ekström,
2011) and earthquake source studies (e.g., Ammon, 2005). Recently,
double-difference seismic source location derived using surface wave
cross-correlations at globally-distributed stations has proven
successful in various geological settings (Chai et al., 2019; Cleveland
et al., 2015, 2018; Cleveland & Ammon, 2013; Howe et al., 2019; Kintner
et al., 2018, 2019, 2020, 2021). These techniques require reliable
surface-wave measurements, which is usually assured through the careful
visual inspection of seismograms. With seismic network deployments
increasing in frequency and size, the amount of available
surface-waveforms is also increasing. More data is unequivocally a good
thing, but quality control of the ever-growing data volumes requires
substantial time and effort. The complexity of surface-wave signals and
the spatially and temporally varying character of seismic background
noise makes reliable automation of the quality control process a
challenge. In some cases, data quality control becomes the most
time-consuming part of a seismological analysis.
Machine learning (ML) has shown promise when applied to a variety of
seismological research problems. This includes body-wave detection and
arrival-time picking (e.g., Chai et al., 2020; Mousavi et al., 2020;
Perol et al., 2018; Ross et al., 2018; Yoon et al., 2015; L. Zhu et al.,
2019; W. Zhu & Beroza, 2018) and signal association (e.g., McBrearty et
al., 2019; Ross et al., 2019). ML has also been used for seismic source
studies that include earthquake location (e.g., X. Zhang et al., 2020),
earthquake magnitude estimation (e.g., Mousavi & Beroza, 2020),
earthquake focal mechanism determination (e.g., Kuang et al., 2021), and
seismic signal discrimination (e.g., Li et al., 2018; Meier et al.,
2019; Seydoux et al., 2020). ML algorithms have also been developed for
seismic tomography (e.g., Bianco & Gerstoft, 2018; Z. Zhang & Lin,
2020), and laboratory earthquake prediction (e.g., Rouet-Leduc et al.,
2017). Most existing work has focused on body-wave analysis, few studies
have focused on applying ML to the quality control of regional and
teleseismic intermediate-period surface-waveforms.
An important application of ML in geophysics is to reduce the burden of
seismic processing to a level that allows more observations (more
earthquakes, more seismograms, etc.) to be included in seismic analyses.
We develop automated quality control processes that decrease the data
quality assessment burden and increase overall data quality applicable
to research efforts into earth structure (Herrmann et al., 2021) and
seismic source analysis (e.g., Lay et al., 2018), while also being a
source of data for long standing projects that quantify earthquake
sources from regional to global scales (e.g., Ekström et al., 2012). No
automated process is perfect, but application of ML approaches can
effectively and efficiently identify the best and worst data and allow
human attention to focus on marginal-quality and unexpected observations
that require more understanding and experience to assess.
In this work, we explore the opportunities of ML to aid in the analysis
of intermediate-period regional and teleseismic seismic surface waves.
We compiled roughly 400,000 surface-wave signals and associated quality
labels from stations around the globe. The quality labels are from past
studies that focused on events in various tectonic settings. We trained
five ML models including logistic regression (LR, Hosmer Jr et al.,
2013), support vector machine (SVM, Suykens & Vandewalle, 1999),
K-nearest neighbors (KNN, Keller et al., 1985), random forests (RF,
Breiman, 2001), and artificial neural networks (ANN, Jain et al., 1996)
to perform automated quality control processing of intermediate-period
surface-wave seismograms. We compared the performance, speed, and disk
usage of these ML techniques. We also tested the general applicability
of the best-performing model to events from other geographic regions.
2 Data
The data consist of seismic waveforms (along with metadata) and quality
labels. The seismograms were downloaded from the Incorporated Research
Institutions for Seismology (IRIS) Data Management Center (DMC) archive.
Each waveform is associated with a particular seismic event that has
known location and origin time information. The seismograms start six
minutes before the origin time and end 200 minutes after the origin
time. We removed the instrument response from the seismograms and
rotated the horizontal components to the radial and transverse
coordinate system from the original north-south east-west coordinates.
To isolate intermediate-period Love and Rayleigh waves, seismograms were
bandpass filtered to isolate signals with periods between 30 and 60 s.
2.1 Seismic data
During the model construction stage, we used observations from 759
seismic events and 4,502 seismic stations (Figure 1). The seismograms
were analyzed for previous earthquake relocation efforts (Cleveland et
al., 2018; Cleveland & Ammon, 2013, 2015; Kintner et al., 2018, 2019).
The origin times of these seismic events range from May 1989 to October
2016 (Figure S1a). The magnitudes of the events range from roughly 4.5
to 7.8 (Figure S1b). The event-station distance spans a wide range from
10- to 180-degree (Figure S1c). Using a group velocity range from 5.0 to
2.5 km/s, the expected surface-wave window length ranges from 222 s to
3979 s (Figure S1d). We refer to these seismograms as dataset DA.