Figure 2 . A comparison of (a) performance and (b) runtime for the test set from dataset DA. The performance analysis include all training samples in the dataset. The runtime is calculated by recording the time it takes for different ML algorithms to load the trained model and compute quality labels for 100,000 seismograms.
We also constructed ML models using subsets of the complete training set to investigate the model performance as a function of the number of training samples. This analysis consisted of training sets built using 100, 200, 500, 1000, 2000, 5000, 10000, 20000, 50000, and 100000 waveforms. As expected, the F1-score for all the algorithms improved with an increasing number of training samples (Figure 3a and 3b). However, as model performance increases, more training samples are needed to improve the model performance by the same percentage. That is, initial improvement occurs rapidly, but as the dataset grows and accuracy increases, significantly more data are needed to make a substantial performance improvement. The RF algorithm has the best accuracy and F1 score when the number of training samples is less than or equal to 20,000. The ANN algorithm surpassed the RF method when the training samples exceed 20,000. As shown in Figure 3c, the training time (using thirty-two 2.1-GHz Intel Xeon cores) for LR, KNN, and RF algorithms is less than the other ML techniques. The training time for the SVM models increases rapidly with the number of training samples. The ANN model took longer to train, but the training time increases more slowly with the number of training samples.