Figure 3 . A comparison of performance (a) and (b), training
time (c), and disk space usage (d) for different algorithms. The legends
of (b) and (c) are the same as (a).
4.2 Model Applications
We compared the performance of the ANN and RF models against three human
analysts using datasets 1, 2, and 3. The results shown in Figure 4
indicate that the ANN and RF models performed similarly to human
analysts for all three datasets. Of course the ANN and RF models only
used 0.5% of the average human processing time (Figure 4b). In some
cases, the ANN and RF models identified useable data that were rejected
by one of human analysts (see Figure 4e for an example). The direct
outputs of the ANN and RF models are probability scores (range from 0 to
1), which are then converted into two categories using a default
threshold of 0.5, accepted (larger than or equal to 0.5) or rejected
(smaller than 0.5). The probability threshold can be adjusted for a
stricter screening. Increasing the threshold can improve the performance
as shown in Figure 4c and 4d. When the threshold is larger than 0.5,
three categories can be assigned to a seismogram instead of two. For
example, a signal can be rejected if its probability score is smaller
than 0.4, accepted if the probability is larger than or equal to 0.6, or
considered marginal if its probability is between 0.4 and 0.6. The
marginal seismograms can be further inspected by human analysts. As
expected, a higher threshold leads to a smaller number of nonmarginal
(accepted or rejected) labels (Figure 4c and 4d) or in other words more
waveforms for human analysts to inspect. Similar to human analysts, the
ANN and RF models sometimes agree and other times disagree. For dataset
3, the ANN or RF models combined mislabeled 540 seismograms out of a
total of 2000. Both methods incorrectly labeled a subset of 186
seismograms (9% of the total); the ANN model mislabeled an additional
207 seismograms (393 total, overall 80% correct); the RF model
mislabeled another 147 seismograms (333 total, overall 83% correct).
Though not directly trained for the quality control of group velocity
estimation, we tested the ANN model to determine whether it would reduce
outliers in automated group velocity measurements. The ANN model
performed reasonably well for dataset DC reducing the number of
unrealistic group velocity values using the ANN-based quality control
(Figure S11). The result is not perfect but the operational burden of
inspecting the outlier observations is substantially reduced. Transfer
learning (e.g., Chai et al., 2020) may further improve the performance
of the ANN model for the quality control of group velocities.
5 Conclusions and Discussion
Using nearly 400,000 waveforms and corresponding quality labels, we
applied and compared five ML algorithms (LR, SVM, KNN, RF, and ANN)
intended to improve the efficiency of the quality control of
surface-wave seismograms. Considering performance, processing speed, and
storage requirements, the ANN achieved an accuracy of 0.92, an F1 score
of 0.89, and an AUC of 0.97. The RF model follows the ANN closely with
slightly lower performance and higher storage requirements, but faster
processing times. We prefer the ANN and RF models over the other
algorithms tested. The performances of both the ANN and RF model match
human analysts for data they have never seen while also reducing the
time invested in surface-wave quality control by 99.5%. We also show
that quality labels from the ANN model helps reduce outliers in group
velocity measurements, despite the training labels originally being
generated for the purposes of signal cross-correlation analysis. The
improved processing speed of the ANN model compared to human analysts
and a demonstration of this method to independent surface-wave
measurements shows that this technique can be used to reduce the burden
of quality control screening for large volumes of seismic data.
The trained ANN and RF models can be incorporated into an existing
workflow that uses intermediate-period surface wave seismograms for
earthquake and/or earth-structure studies. For fast-response
applications, these two trained ML models can be applied automatically
to identify good-quality data rapidly without human intervention. The
execution speed of the two ML models can be easily increased with more
computing resources. For more comprehensive studies, the trained models
can be used to pre-screen a large amount of data and allow researchers
to focus on a subset of data ranked by ML labels. The numeric quality
scores from the RF and ANN ML models could also be used as initial
quality weights in seismological analysis.