Discussion
The aims of this study were (1) to evaluate the wide applicability of proteomic fingerprinting for species identification in marine science across different metazoan phyla and classes, (2) to identify critical steps in sample preparation and data processing, and (3) to contribute to the development of standard procedures and best practices for MALDI-TOF MS based metazoan classification. The general applicability to metazoans has been proven before (Mazzeo et al., 2008; Dieme et al., 2014; Yssouf et al., 2014; Flaudrops et al., 2015; Mazzeo and Siciliano, 2016; Maász et al., 2017; Rossel and Martínez Arbizu, 2019; Rossel et al., 2020a). However, here we show for the first time the applicability of this method to a large taxonomic range using a comprehensive data set with an overall species identification success rate of 93%.
Similar high identification success rates on species level were observed for individual metazoan groups (Hynek et al., 2018; Vega-Rúa et al., 2018; Holst et al., 2019; Loaiza et al., 2019; Rakotonirina et al., 2020; Rossel et al., 2020a). Additionally, our results show that specimens absent from the reference library will be assigned to the correct phyla or class with a high probability implying some kind of phylogenetic signal on higher taxonomic level as was already reported for congeneric Drosophila before (Feltens et al., 2010). Testing if species would be classified as a congeneric species in the absence of the actual species was less promising in our study with only 30% of specimens being assigned to a congeneric species. This complies with other studies that only show occasional similarity of congeneric species e.g. in cluster analyses but without consistency across all congeneric species (Laakmann et al., 2013; Chavy et al., 2019; Rossel and Martínez Arbizu, 2019).
In closely related species, morphological identification can often be complicated. Using proteomic fingerprinting, these problems can however be resolved as indicated by the analysis of the A. irregulariscomplex. Even though mass spectra show high similarities, distinct patterns of peak presence and absence as well as pronounced differences in relative peak intensities serve as good markers for species identification. Beyond mere species identification, the example ofE. acutifrons shows the power of the method to differentiate specimens even on a sex level. This has been shown before for e.g. the fish species Alburnus alburnus (Linnaeus, 1758) (Maász et al., 2017). Whereas authors focused on presence and absence of peaks, we were able to show that also relative intensities of certain mass peaks play an important role in differentiation of sexes. Prior studies on larger planktonic copepods have also shown a great potential for differentiation of developmental stages based on a proteomic fingerprint (Rossel et al., 2022).
Finally, we have shown the necessity of comprehensive reference libraries. Low numbers of specimens per species in reference libraries fail to provide sufficient information on species specific mass spectra features and intraspecific variability. Only with around nine to ten reference specimens per species, the identification error stabilizes on a constantly low level. This supports findings by Rakotonirina et al (2020) who found an increase of identification score with increasing numbers of available main spectrum patterns. In general we would recommend to use more than three specimens per species and preferably to include around ten specimens for every species in a reference library.
MALDI-TOF MS can be used as a universal method for species identification of metazoan species. Due to the short preparation time, low costs (Tran et al., 2015; Rossel et al., 2019) and high identification success it can be a valuable tool in biodiversity assessments replacing time-intense morphological identification or costly DNA barcoding. Especially in cases of closely related or very similar species it can foster a rapid identification. The applicability of proteome fingerprinting for the differentiation of cryptic species was already shown and even in cases of morphologically very similar species, still differences were found (Müller et al., 2013; Paulus et al., 2022).
Tissue samples used in this work were obtained from specimens stored between seven to 12 years under partly unknown storage conditions. We assume working with fresh or recently fixed material would have resulted in even higher identification success rates. This is supported by the high mass spectra quality obtained from fish species, which were extracted and put into freezer storage almost immediately after sampling (personal communication Knebelsberger). The adverse effect of fixation and storage on resulting mass spectra quality in metazoans was investigated several times and supports this assumption (Rossel and Martínez Arbizu, 2018b; Rakotonirina et al., 2020). We received good results for storage at -20°C and also for long-term storage at -80°C, thus we recommend cold storage of samples at -20°C, until further systematic analyses will specify threshold temperatures for short- (months) or long-term (years) storage.
Our tests have shown that sample concentration is pivotal to obtain good quality mass spectra. While too low sample/matrix ratios will result in lower intensities and a higher baseline, too much tissue will increase the noise in the data and result in unsuccessful measurements. For all investigated taxa, the same sample preparation method was used; however attention must be paid to the correct ratio of matrix and compound to be analyzed. This allows the wide application of this method without adaptation of the protocol to a certain species as it would be necessary for methods such as COI barcoding where certain groups would need highly specific sets of amplification primers (Lohman et al., 2009; Toumi et al., 2013) and adjustment of PCR settings.
Much effort is put into optimizing mass spectra quality by adjusting different preparation protocols (Jeverica et al., 2018; Wang et al., 2021) or developing methods for steps such as baseline correction, smoothing or peak picking (Ressom et al., 2007; Shin et al., 2010). Methods are adjusted either to increase classification success or to obtain better mass spectra reproducibility. Here, we tested the influence of certain steps during data processing on classification success focusing on the important steps for peak detection. Whereas baseline subtraction and adjustment of a SNR value both aim at reducing noise within the data, adjusting the HWS influences the peak picking resolution. Thus, by decreasing the HWS during peak detection, the number of peaks will increase as the highest peak within the HWS will be the detected. This will result in peaks of very similar size being recognized as distinct peaks, rather than being put together in a single bin. This does also explain the high effect of both parameters SNR and HWS compared to baseline subtraction. Baseline subtraction is constrained towards reducing instrument-dependent noise. Adjustment of the SNR value will however, like HWS alteration, affect the number of more dominant peaks and thus the general resolution of the mass spectra. Hence, more species-specific information is retained and more information is available for classification. Based on our results, rather than testing all variables, adjusting SNR and HWS should be adequate to optimize the data pipeline. However, it needs to be emphasized that this pipeline aims at optimizing species identification and may not be adequate for investigation of intraspecific variability as was shown elsewhere16.
In summary, we propose a workflow applicable for any metazoan species or tissue sample to be identified: A comprehensive reference library is needed with species level identification by morphological or molecular approaches (Fig. 5a). In the lab, a small tissue (up to 1mm³) is retrieved and incubated for at least 5 minutes in the HCCA-matrix solution. Of the resulting extract, 1 to 1.5 µl are transferred to a target plate for measurement. Data processing is carried out in R (Fig 5b). Mass spectra quality is done by eye and supported by R-packages such as MALDIrppa (Palarea-Albaladejo et al., 2017). Finally, based on previously assessed species identification, data processing can be optimized to obtain ideal settings for classification. Depending on our results this can be narrowed to adjustment of HWS- and SNR-value. Based on the reference library, a RF model can be calculated for specimen identification (Fig. 5c). Applying a post-hoc test will provide further support for the identification. If classification is not well supported, a RF model on class or phyla level can be applied to obtain higher-level classification.