Fig. 2: Random forest model quality assessed with increasing number of
specimens per species. For each number of specimens, 100 data sets were
created by random sampling. The OOB error (y-axis) decreases with
increasing number of specimens (x-axis) and starts going into
saturation. Thus, around 10 specimens per species are generally
recommended to obtain a high quality model.
Standardization of data processing
Different steps throughout data processing can have a severe impact on
classification results. The effect of changing the different data
processing steps was evaluated using the RF OOB error as an indicator.
For each data set a RF model was trained and the OOB error recorded
(supplementary figure 1). Whereas alteration of baseline subtraction
iterations generally only had little impact on RF OOB error, changing
HWS and SNR had greater effects (supplementary figure 1). The GAM shows
that the OOB error is significantly influenced by alteration of the HWS
(Table 1, p-value: 0.007) and SNR (Table 1, p-value: 0.001). A
combination of 22 baseline estimation iterations, HWS of 7 and SNR of 3
resulted in the lowest OOB error of 0.032. These settings were used for
further analyses.
Table. 1: Results of the GAM analyses to detect the most important
variable for data processing optimization.