Figure 3 . A comparison of performance (a) and (b), training time (c), and disk space usage (d) for different algorithms. The legends of (b) and (c) are the same as (a).
4.2 Model Applications
We compared the performance of the ANN and RF models against three human analysts using datasets 1, 2, and 3. The results shown in Figure 4 indicate that the ANN and RF models performed similarly to human analysts for all three datasets. Of course the ANN and RF models only used 0.5% of the average human processing time (Figure 4b). In some cases, the ANN and RF models identified useable data that were rejected by one of human analysts (see Figure 4e for an example). The direct outputs of the ANN and RF models are probability scores (range from 0 to 1), which are then converted into two categories using a default threshold of 0.5, accepted (larger than or equal to 0.5) or rejected (smaller than 0.5). The probability threshold can be adjusted for a stricter screening. Increasing the threshold can improve the performance as shown in Figure 4c and 4d. When the threshold is larger than 0.5, three categories can be assigned to a seismogram instead of two. For example, a signal can be rejected if its probability score is smaller than 0.4, accepted if the probability is larger than or equal to 0.6, or considered marginal if its probability is between 0.4 and 0.6. The marginal seismograms can be further inspected by human analysts. As expected, a higher threshold leads to a smaller number of nonmarginal (accepted or rejected) labels (Figure 4c and 4d) or in other words more waveforms for human analysts to inspect. Similar to human analysts, the ANN and RF models sometimes agree and other times disagree. For dataset 3, the ANN or RF models combined mislabeled 540 seismograms out of a total of 2000. Both methods incorrectly labeled a subset of 186 seismograms (9% of the total); the ANN model mislabeled an additional 207 seismograms (393 total, overall 80% correct); the RF model mislabeled another 147 seismograms (333 total, overall 83% correct).
Though not directly trained for the quality control of group velocity estimation, we tested the ANN model to determine whether it would reduce outliers in automated group velocity measurements. The ANN model performed reasonably well for dataset DC reducing the number of unrealistic group velocity values using the ANN-based quality control (Figure S11). The result is not perfect but the operational burden of inspecting the outlier observations is substantially reduced. Transfer learning (e.g., Chai et al., 2020) may further improve the performance of the ANN model for the quality control of group velocities.
5 Conclusions and Discussion
Using nearly 400,000 waveforms and corresponding quality labels, we applied and compared five ML algorithms (LR, SVM, KNN, RF, and ANN) intended to improve the efficiency of the quality control of surface-wave seismograms. Considering performance, processing speed, and storage requirements, the ANN achieved an accuracy of 0.92, an F1 score of 0.89, and an AUC of 0.97. The RF model follows the ANN closely with slightly lower performance and higher storage requirements, but faster processing times. We prefer the ANN and RF models over the other algorithms tested. The performances of both the ANN and RF model match human analysts for data they have never seen while also reducing the time invested in surface-wave quality control by 99.5%. We also show that quality labels from the ANN model helps reduce outliers in group velocity measurements, despite the training labels originally being generated for the purposes of signal cross-correlation analysis. The improved processing speed of the ANN model compared to human analysts and a demonstration of this method to independent surface-wave measurements shows that this technique can be used to reduce the burden of quality control screening for large volumes of seismic data.
The trained ANN and RF models can be incorporated into an existing workflow that uses intermediate-period surface wave seismograms for earthquake and/or earth-structure studies. For fast-response applications, these two trained ML models can be applied automatically to identify good-quality data rapidly without human intervention. The execution speed of the two ML models can be easily increased with more computing resources. For more comprehensive studies, the trained models can be used to pre-screen a large amount of data and allow researchers to focus on a subset of data ranked by ML labels. The numeric quality scores from the RF and ANN ML models could also be used as initial quality weights in seismological analysis.