Figure 1 . Maps of the (a) earthquakes and (b) seismic stations used. The size of each circle in (a) is proportional to an event’s earthquake magnitude. The gray circles and triangles are used for training (dataset DA), whereas red symbols are used to evaluate the ML model after training is completed (dataset DB). Thick lines are tectonic plate boundaries (Bird, 2003). (For interpretation of the references to color in this figure, the reader is referred to the web version of this article.)
During the model construction stage, we selected 40 seismic events from the United States Geological Survey (USGS) ComCat catalog (between January 1990 and January 2019) with the following criteria. (a) The events were located at least one arc degree away from any seismic events of dataset DA. (b) The events were randomly selected from four magnitude bins (between magnitudes 5 and 6; 6 and 7; 7 and 8; 8 and 9) with 10 events in each bin. Seismic stations including long period high gain seismometers (LH channels, 1 Hz sampling rate) and located between 10- and 180-degree distance were selected. Data from temporary network deployments were excluded. These seismograms will be referred to as dataset DB. After the ML models were trained, we also downloaded seismograms from 184 seismic events (Figure S2) with magnitude 6.0 and larger between January 2018 and May 2020 recorded at the station SSPA located near Standing Stone, Pennsylvania, USA. These seismograms comprise dataset DC.
2.2 Waveform quality labels
We compiled the quality labels for dataset DA from several earthquake relocation studies (Cleveland et al., 2018; Cleveland & Ammon, 2013, 2015; Kintner et al., 2018, 2019). Due to personal preferences, the original quality labels have either five or four categories. When using five categories, the three highest categories were considered acceptable (Figure S3a). For four categories, the two highest were considered acceptable (Figure S3b). To combine the datasets and maximize the number of labels, we mapped the quality labels into two categories, either accepted or rejected (see Figure S4 for waveform examples). The spatial distributions of quality labels show significant variations for different seismic events (Figure S5) due to earthquake source differences and background noise variations.
In addition, after the ML models were trained, three human analysts re-labeled 1,000 seismograms randomly selected from dataset DA. Half of them were assigned the same quality label by both a human analyst and the ANN model and referred to as Dataset 1. The other half were assigned different quality labels by a human analyst and the ANN model and referred to as Dataset 2. The human analysts also labeled 2,000 seismograms from dataset DB after we applied the ANN model to all the seismograms in dataset DB. These seismograms were randomly selected such that (1) each of the 40 distributed-magnitude earthquakes has 50 waveforms and (2) 1,000 seismograms were accepted by the ANN model whereas the other 1,000 seismograms were rejected. We consider the majority vote of the three analysts as ground truth, which is more reliable (but costly) than the labels used in the model construction stage.
3 Methods
Our analyses consisted of two stages (Figure S6), model construction and deployment. During the model construction stage, we compute statistical features from the surface-waveforms and link them with manually-assigned quality labels. These features and labels are then used to train an ML model. During the deployment stage, we obtain and compare ML-derived quality labels by applying the ML model directly to a test set of surface-waveforms not used in model construction.
3.1 Feature engineering
Surface waveforms are one of the most recognizable part of a seismic event’s wavefield, but also one of the most variable. The character of the signal changes with source-to-station distance, geology along the wave’s path, as well as the earthquake rupture characteristics and faulting geometry. To capture these complexities in a reasonable number of parameters, we employed a total of 301 features for each waveform (data sample). The features were computed from waveform segments that include: (1) the expected surface-wave arrival window (defined using a group velocity window from 5.0 to 2.5 km/s); (2) a time window with common duration before the surface waves; (3) ten evenly divided time windows spanning the entire data sample. For each time window, we calculated absolute energy (sum of all time samples squared), the sum of absolute derivatives, kurtosis, skewness, maximum, minimum, mean, standard deviation, nine quantiles (10%-90%), and number of time samples. In additional to absolute values, we also included ratios of these statistical features (excluding the number of samples and absolute sum of changes) for two time-window pairs. The first pair includes the surface-wave window and the time window ahead of the surface wave. The second pair includes the two windows from the ten evenly sized windows that have the maximum and minimum absolute energy. We also included the magnitude of the earthquake, event depth, azimuth, and distance between the station and seismic event as signal-related features. As with all signal classification studies, we explored these features guided by our experience with surface wave analysis as well as numerical experiments using the training and validation sets.
3.2 Machine learning
In the model construction stage, we used data from dataset DA and randomly split it into three sets. We used 277,213 samples (waveforms) for training (70% of total), 39,601 samples for validation (10% of total), and 79,205 samples for testing (20% of total). The validation set was used to choose training parameters and features, the test set was used to evaluate the performance of the ML models. We used scikit-learn’s implementation of the LR, SVM, KNN, and RF. The ANN was implemented with Keras. For the SVM algorithm, we used both a linear kernel (SVM-Linear) and a nonlinear kernel (SVM-Gaussian). The KNN model used five closest neighbors. The RF model contains 100 trees. The ANN model has six fully connected hidden layers (256 neurons), which used the rectified linear units (ReLU) activation function and followed by a dropout layer (10% dropout rate) to reduce overfitting. We set the batch size as 20 and the learning rate as 0.00001.
3.3 ML model generality
We tested the generality of the ANN model using a collection of seismic events located in different regions than the events used in the model construction stage. The qualities of a subset of seismograms were visually assessed by three analysts and compared against the ANN model results. The original research objective for assigning a signal’s quality label was to decide whether it had the bandwidth and signal-to-noise ratio to perform well in a cross-correlation analysis, as well as to recognize interference with other arrivals, instrument issues, nodal signals, etc.. We tested the ANN’s generality using it as a screening procedure for an automated measurement of surface-wave group velocities. The model was applied to surface-wave seismograms in Dataset DB (see next section). Surface-wave group velocities were automatically estimated from seismograms in Dataset DC. Many of the group velocities estimated from seismograms rejected by the ANN model were clear outliers.
4 Results
4.1 ML model construction and assessment
The performance of a classifier can be measured in a number of different ways, but most essential metrics are constructed using the numbers of positive and negative success and failure rates of the classifier. When trained using all the training samples, RF and ANN model out-performed LR, SVM-Linear, KNN, SVM-Gaussian when applied to the test dataset in terms of accuracy score, F1 score (see Text S1 for detailed definition), and area under the receiver operating characteristic curve (AUC, see Text S1 for detailed definition) as shown in Figure 2a. The receiver operating characteristic curves show the same pattern as AUC (Figure S7). The accuracy score, F1 score, and AUC for the ANN model are 0.92, 0.89, and 0.97, respectively. The performance of LR and SVM-Linear was the poorest. The confusion matrices also show that the RF and ANN models performed better than others (Figure S8).
We visually checked waveforms that the ANN model assigned different labels than a human analyst using interactive visualization tools similar to Chai et al. (2018). We observed both human quality assignment errors as well as errors by the ANN model (see Figure S9 and S10 for examples). The results indicate that the ANN was working at least as accurately as human analysts. Mislabeling by human analysts is not surprising given the tediousness of the task and the natural inclination for humans to tire during the process. Mislabeling by the ANN represents the appearance of a signal with characteristics that are not in the training set, or combinations of features that contradict the general patterns in the training data.
The runtime (which includes loading the trained model and computing quality labels) of the LR, RF, and ANN model are among the fastest for 100,000 seismograms (using six 2.9 GHz CPU cores) (Figure 2b). SVM models are the slowest since the algorithm used was not parallelized. The trained KNN model uses the most disk space (1.4 GB), the LR model required the least disk space (3 KB) (Figure 3d). SVM-Linear, RF, and SVM-Gaussian require comparable storage. The ANN model requires 5 MB of storage. Considering performance, runtime, and disk space, we prefer the ANN model and the RF model for assigning a quality control value to surface-wave seismograms.