Fig. 1: Results of the sample preparation test. All graphs show peak intensity on the y-axis. In A, x-axis represents tissue:matrix ratio in µg per µl. In B C and D m/z values (ratio of molecule mass and loading) are depicted on the x-axis. A) Maximum intensities as a measure of quality for the different sample to HCCA matrix ratios assessed for four species. Additionally, for Cancer pagurus a dilution series (brown) was carried out. B) Good quality spectrum at a ratio of 3.12 µg tissue per µl matrix. C) Lower quality spectrum at 0.39 µg µl-1 showing a high baseline. D) Lower quality spectrum at 25 µg µl-1 showing stronger noise.
Optimize Random Forest model for classification
For application of RF as a method for classification, we evaluated how strongly the number of specimens per species influences model error. A repeated (n=100) random sampling of two to eleven specimens for species with at least 11 specimens in the data set (n=20) was carried out. This data was then used to create RF models and the OOB error was assessed as a quality criterion. Increasing the number of specimens per species resulted in a decrease of OOB error (Fig. 2). With only two specimens per species the OOB error ranges from 0 to 0.375 with a mean error of 0.18 (SD = 0.073). With eleven specimens per species, the error ranges from 0.005 to 0.036 with a mean error of 0.019 (SD = 0.008). The decrease in OOB error goes nearly into saturation for n >10. For further analyses, we chose n = 6 because the results show a strong decrease in OOB-error variability and a strong decrease in maximum OOB error at this point.