Figure 2 . A comparison of (a) performance and (b) runtime for
the test set from dataset DA. The performance analysis include all
training samples in the dataset. The runtime is calculated by recording
the time it takes for different ML algorithms to load the trained model
and compute quality labels for 100,000 seismograms.
We also constructed ML models using subsets of the complete training set
to investigate the model performance as a function of the number of
training samples. This analysis consisted of training sets built using
100, 200, 500, 1000, 2000, 5000, 10000, 20000, 50000, and 100000
waveforms. As expected, the F1-score for all the algorithms improved
with an increasing number of training samples (Figure 3a and 3b).
However, as model performance increases, more training samples are
needed to improve the model performance by the same percentage. That is,
initial improvement occurs rapidly, but as the dataset grows and
accuracy increases, significantly more data are needed to make a
substantial performance improvement. The RF algorithm has the best
accuracy and F1 score when the number of training samples is less than
or equal to 20,000. The ANN algorithm surpassed the RF method when the
training samples exceed 20,000. As shown in Figure 3c, the training time
(using thirty-two 2.1-GHz Intel Xeon cores) for LR, KNN, and RF
algorithms is less than the other ML techniques. The training time for
the SVM models increases rapidly with the number of training samples.
The ANN model took longer to train, but the training time increases more
slowly with the number of training samples.