Plain Language Summary
Surface waves generated by earthquakes carry valuable information about Earth’s subsurface and the sources that generate them. To reliably and robustly extract information from a suite of surface waveforms, the signals require quality control screening. This process has typically been done by experts labeling each data sample visually, which is time-consuming and tedious for large datasets. To speed up signal quality assessment, we trained machine learning methods using a large set of human-labeled waveforms. We compared five techniques: logistic regression, support vector machines, k-nearest neighbors, random forests, and artificial neural networks. The artificial neural networks performed the best and achieved an accuracy of 92%. Once trained, the neural network model matched human performance but reduced the time cost by 99.5% when applied to data it had never seen. Our analyses demonstrate the capability of automated processing to improve quality in surface-wave-related measurements without human quality control screening.
1 Introduction
Surface waves have long been used for subsurface imaging (e.g., Ekström, 2011) and earthquake source studies (e.g., Ammon, 2005). Recently, double-difference seismic source location derived using surface wave cross-correlations at globally-distributed stations has proven successful in various geological settings (Chai et al., 2019; Cleveland et al., 2015, 2018; Cleveland & Ammon, 2013; Howe et al., 2019; Kintner et al., 2018, 2019, 2020, 2021). These techniques require reliable surface-wave measurements, which is usually assured through the careful visual inspection of seismograms. With seismic network deployments increasing in frequency and size, the amount of available surface-waveforms is also increasing. More data is unequivocally a good thing, but quality control of the ever-growing data volumes requires substantial time and effort. The complexity of surface-wave signals and the spatially and temporally varying character of seismic background noise makes reliable automation of the quality control process a challenge. In some cases, data quality control becomes the most time-consuming part of a seismological analysis.
Machine learning (ML) has shown promise when applied to a variety of seismological research problems. This includes body-wave detection and arrival-time picking (e.g., Chai et al., 2020; Mousavi et al., 2020; Perol et al., 2018; Ross et al., 2018; Yoon et al., 2015; L. Zhu et al., 2019; W. Zhu & Beroza, 2018) and signal association (e.g., McBrearty et al., 2019; Ross et al., 2019). ML has also been used for seismic source studies that include earthquake location (e.g., X. Zhang et al., 2020), earthquake magnitude estimation (e.g., Mousavi & Beroza, 2020), earthquake focal mechanism determination (e.g., Kuang et al., 2021), and seismic signal discrimination (e.g., Li et al., 2018; Meier et al., 2019; Seydoux et al., 2020). ML algorithms have also been developed for seismic tomography (e.g., Bianco & Gerstoft, 2018; Z. Zhang & Lin, 2020), and laboratory earthquake prediction (e.g., Rouet-Leduc et al., 2017). Most existing work has focused on body-wave analysis, few studies have focused on applying ML to the quality control of regional and teleseismic intermediate-period surface-waveforms.
An important application of ML in geophysics is to reduce the burden of seismic processing to a level that allows more observations (more earthquakes, more seismograms, etc.) to be included in seismic analyses. We develop automated quality control processes that decrease the data quality assessment burden and increase overall data quality applicable to research efforts into earth structure (Herrmann et al., 2021) and seismic source analysis (e.g., Lay et al., 2018), while also being a source of data for long standing projects that quantify earthquake sources from regional to global scales (e.g., Ekström et al., 2012). No automated process is perfect, but application of ML approaches can effectively and efficiently identify the best and worst data and allow human attention to focus on marginal-quality and unexpected observations that require more understanding and experience to assess.
In this work, we explore the opportunities of ML to aid in the analysis of intermediate-period regional and teleseismic seismic surface waves. We compiled roughly 400,000 surface-wave signals and associated quality labels from stations around the globe. The quality labels are from past studies that focused on events in various tectonic settings. We trained five ML models including logistic regression (LR, Hosmer Jr et al., 2013), support vector machine (SVM, Suykens & Vandewalle, 1999), K-nearest neighbors (KNN, Keller et al., 1985), random forests (RF, Breiman, 2001), and artificial neural networks (ANN, Jain et al., 1996) to perform automated quality control processing of intermediate-period surface-wave seismograms. We compared the performance, speed, and disk usage of these ML techniques. We also tested the general applicability of the best-performing model to events from other geographic regions.
2 Data
The data consist of seismic waveforms (along with metadata) and quality labels. The seismograms were downloaded from the Incorporated Research Institutions for Seismology (IRIS) Data Management Center (DMC) archive. Each waveform is associated with a particular seismic event that has known location and origin time information. The seismograms start six minutes before the origin time and end 200 minutes after the origin time. We removed the instrument response from the seismograms and rotated the horizontal components to the radial and transverse coordinate system from the original north-south east-west coordinates. To isolate intermediate-period Love and Rayleigh waves, seismograms were bandpass filtered to isolate signals with periods between 30 and 60 s.
2.1 Seismic data
During the model construction stage, we used observations from 759 seismic events and 4,502 seismic stations (Figure 1). The seismograms were analyzed for previous earthquake relocation efforts (Cleveland et al., 2018; Cleveland & Ammon, 2013, 2015; Kintner et al., 2018, 2019). The origin times of these seismic events range from May 1989 to October 2016 (Figure S1a). The magnitudes of the events range from roughly 4.5 to 7.8 (Figure S1b). The event-station distance spans a wide range from 10- to 180-degree (Figure S1c). Using a group velocity range from 5.0 to 2.5 km/s, the expected surface-wave window length ranges from 222 s to 3979 s (Figure S1d). We refer to these seismograms as dataset DA.