Fig. 4: A The 30 most important peaks for differentiation of the starfish A. irregularis groups within the random forest model. Species according to COI delimitation are given on top. Molecule masses sorted by size are given on the left hand side. BHierarchical clustering depicts differentiation of the copepod E. acutifrons specimens on sex level. Nodal bootstrap support is displayed at the nodes of the tree. The heatmap below the clustering results depicts the 30 most important mass peaks for sex-differentiation using a random forest model with color-coded peak intensities. Data from the marine copepod Microarthridion littorale (Poppe, 1881) from the same study was used here as an outgroup species. Relative intensities are color coded.
Case study - sex determination
In previous research it was shown that sex determination may be possible in some species by analyzing the proteomic fingerprint (Rossel and Martínez Arbizu, 2019), however the data was not analyzed any further therein. In depth analyses emphasize these findings and show sex-specific protein patterns in the crustacean copepod Euterpina acutifrons (Fig. 4B). Mass peaks such as m/z 2523, 2929 and 7417 are female specific and not found in any of the male specimens. Others however, predominantly occur in male specimens (m/z 3638, 3719). Further mass peaks are evenly observed in measurements from both sexes but show intensity-pattern differences.
Phyla and class models for identification
If a species is not part of a reference library, it may be desirable to obtain a higher level classification. To test if this is possible based on MALDI-TOF mass spectra of metazoans, species were systematically taken out of the RF training data set and classified with a RF model that was trained on higher taxonomic level but does not include any information on the respective species to be classified. Regarding all phyla together, a classification success of 81% (77% true positive rate (tpr)) was achieved with phyla-wise success rates ranging from 73% (64 % tpr) in Echinodermata to 95% (92% tpr) in Chordata (Fig. 3B). On class level the combined success rate was 72% (66% tpr) ranging from 7% (0% tpr) in Polyplacophora, for which only two species were included in the data set, to 96% (94% tpr) in Teleostei.
For 31 taxa (n = 324), a congeneric species was included. Thus, it was tested if species have a higher affinity to be classified as a congeneric species in case the respective species is removed from the training data. Of these 31 taxa, 30% of specimens were classified as a congeneric species.