Figure 1 . Maps of the (a) earthquakes and (b) seismic stations
used. The size of each circle in (a) is proportional to an event’s
earthquake magnitude. The gray circles and triangles are used for
training (dataset DA), whereas red symbols are used to evaluate the ML
model after training is completed (dataset DB). Thick lines are tectonic
plate boundaries (Bird, 2003). (For interpretation of the references to
color in this figure, the reader is referred to the web version of this
article.)
During the model construction stage, we selected 40 seismic events from
the United States Geological Survey (USGS) ComCat catalog (between
January 1990 and January 2019) with the following criteria. (a) The
events were located at least one arc degree away from any seismic events
of dataset DA. (b) The events were randomly selected from four magnitude
bins (between magnitudes 5 and 6; 6 and 7; 7 and 8; 8 and 9) with 10
events in each bin. Seismic stations including long period high gain
seismometers (LH channels, 1 Hz sampling rate) and located between 10-
and 180-degree distance were selected. Data from temporary network
deployments were excluded. These seismograms will be referred to as
dataset DB. After the ML models were trained, we also downloaded
seismograms from 184 seismic events (Figure S2) with magnitude 6.0 and
larger between January 2018 and May 2020 recorded at the station SSPA
located near Standing Stone, Pennsylvania, USA. These seismograms
comprise dataset DC.
2.2 Waveform quality labels
We compiled the quality labels for dataset DA from several earthquake
relocation studies (Cleveland et al., 2018; Cleveland & Ammon, 2013,
2015; Kintner et al., 2018, 2019). Due to personal preferences, the
original quality labels have either five or four categories. When using
five categories, the three highest categories were considered acceptable
(Figure S3a). For four categories, the two highest were considered
acceptable (Figure S3b). To combine the datasets and maximize the number
of labels, we mapped the quality labels into two categories, either
accepted or rejected (see Figure S4 for waveform examples). The spatial
distributions of quality labels show significant variations for
different seismic events (Figure S5) due to earthquake source
differences and background noise variations.
In addition, after the ML models were trained, three human analysts
re-labeled 1,000 seismograms randomly selected from dataset DA. Half of
them were assigned the same quality label by both a human analyst and
the ANN model and referred to as Dataset 1. The other half were assigned
different quality labels by a human analyst and the ANN model and
referred to as Dataset 2. The human analysts also labeled 2,000
seismograms from dataset DB after we applied the ANN model to all the
seismograms in dataset DB. These seismograms were randomly selected such
that (1) each of the 40 distributed-magnitude earthquakes has 50
waveforms and (2) 1,000 seismograms were accepted by the ANN model
whereas the other 1,000 seismograms were rejected. We consider the
majority vote of the three analysts as ground truth, which is more
reliable (but costly) than the labels used in the model construction
stage.
3 Methods
Our analyses consisted of two stages (Figure S6), model construction and
deployment. During the model construction stage, we compute statistical
features from the surface-waveforms and link them with manually-assigned
quality labels. These features and labels are then used to train an ML
model. During the deployment stage, we obtain and compare ML-derived
quality labels by applying the ML model directly to a test set of
surface-waveforms not used in model construction.
3.1 Feature engineering
Surface waveforms are one of the most recognizable part of a seismic
event’s wavefield, but also one of the most variable. The character of
the signal changes with source-to-station distance, geology along the
wave’s path, as well as the earthquake rupture characteristics and
faulting geometry. To capture these complexities in a reasonable number
of parameters, we employed a total of 301 features for each waveform
(data sample). The features were computed from waveform segments that
include: (1) the expected surface-wave arrival window (defined using a
group velocity window from 5.0 to 2.5 km/s); (2) a time window with
common duration before the surface waves; (3) ten evenly divided time
windows spanning the entire data sample. For each time window, we
calculated absolute energy (sum of all time samples squared), the sum of
absolute derivatives, kurtosis, skewness, maximum, minimum, mean,
standard deviation, nine quantiles (10%-90%), and number of time
samples. In additional to absolute values, we also included ratios of
these statistical features (excluding the number of samples and absolute
sum of changes) for two time-window pairs. The first pair includes the
surface-wave window and the time window ahead of the surface wave. The
second pair includes the two windows from the ten evenly sized windows
that have the maximum and minimum absolute energy. We also included the
magnitude of the earthquake, event depth, azimuth, and distance between
the station and seismic event as signal-related features. As with all
signal classification studies, we explored these features guided by our
experience with surface wave analysis as well as numerical experiments
using the training and validation sets.
3.2 Machine learning
In the model construction stage, we used data from dataset DA and
randomly split it into three sets. We used 277,213 samples (waveforms)
for training (70% of total), 39,601 samples for validation (10% of
total), and 79,205 samples for testing (20% of total). The validation
set was used to choose training parameters and features, the test set
was used to evaluate the performance of the ML models. We used
scikit-learn’s implementation of the LR, SVM, KNN, and RF. The ANN was
implemented with Keras. For the SVM algorithm, we used both a linear
kernel (SVM-Linear) and a nonlinear kernel (SVM-Gaussian). The KNN model
used five closest neighbors. The RF model contains 100 trees. The ANN
model has six fully connected hidden layers (256 neurons), which used
the rectified linear units (ReLU) activation function and followed by a
dropout layer (10% dropout rate) to reduce overfitting. We set the
batch size as 20 and the learning rate as 0.00001.
3.3 ML model generality
We tested the generality of the ANN model using a collection of seismic
events located in different regions than the events used in the model
construction stage. The qualities of a subset of seismograms were
visually assessed by three analysts and compared against the ANN model
results. The original research objective for assigning a signal’s
quality label was to decide whether it had the bandwidth and
signal-to-noise ratio to perform well in a cross-correlation analysis,
as well as to recognize interference with other arrivals, instrument
issues, nodal signals, etc.. We tested the ANN’s generality using it as
a screening procedure for an automated measurement of surface-wave group
velocities. The model was applied to surface-wave seismograms in Dataset
DB (see next section). Surface-wave group velocities were automatically
estimated from seismograms in Dataset DC. Many of the group velocities
estimated from seismograms rejected by the ANN model were clear
outliers.
4 Results
4.1 ML model construction and assessment
The performance of a classifier can be measured in a number of different
ways, but most essential metrics are constructed using the numbers of
positive and negative success and failure rates of the classifier. When
trained using all the training samples, RF and ANN model out-performed
LR, SVM-Linear, KNN, SVM-Gaussian when applied to the test dataset in
terms of accuracy score, F1 score (see Text S1 for detailed definition),
and area under the receiver operating characteristic curve (AUC, see
Text S1 for detailed definition) as shown in Figure 2a. The receiver
operating characteristic curves show the same pattern as AUC (Figure
S7). The accuracy score, F1 score, and AUC for the ANN model are 0.92,
0.89, and 0.97, respectively. The performance of LR and SVM-Linear was
the poorest. The confusion matrices also show that the RF and ANN models
performed better than others (Figure S8).
We visually checked waveforms that the ANN model assigned different
labels than a human analyst using interactive visualization tools
similar to Chai et al. (2018). We observed both human quality assignment
errors as well as errors by the ANN model (see Figure S9 and S10 for
examples). The results indicate that the ANN was working at least as
accurately as human analysts. Mislabeling by human analysts is not
surprising given the tediousness of the task and the natural inclination
for humans to tire during the process. Mislabeling by the ANN represents
the appearance of a signal with characteristics that are not in the
training set, or combinations of features that contradict the general
patterns in the training data.
The runtime (which includes loading the trained model and computing
quality labels) of the LR, RF, and ANN model are among the fastest for
100,000 seismograms (using six 2.9 GHz CPU cores) (Figure 2b). SVM
models are the slowest since the algorithm used was not parallelized.
The trained KNN model uses the most disk space (1.4 GB), the LR model
required the least disk space (3 KB) (Figure 3d). SVM-Linear, RF, and
SVM-Gaussian require comparable storage. The ANN model requires 5 MB of
storage. Considering performance, runtime, and disk space, we prefer the
ANN model and the RF model for assigning a quality control value to
surface-wave seismograms.