Dave Vieglais - Authorea

Material samples are vital across multiple scientific disciplines with samples collected for one project often proving valuable for additional studies. The Internet of Samples (iSamples) project aims to integrate large, diverse, cross-discipline sample repositories and enable access and discovery of material samples as FAIR data (Findable, Accessible, Interoperable, and Reusable). Here we report our recent progress in controlled vocabulary development and mapping. In addition to a core metadata schema to integrate SESAR, GEOME, Open Context, and Smithsonian natural history collections, three small but important controlled vocabularies (CVs) describing specimen type, material type, and sampled feature were created. The new CVs provide consistent semantics for high-level integration of existing vocabularies used in the source collections. Two methods were used to map source record properties to terms in the new CVs: Keyword-based heuristic rules were manually created where existing terminologies were similar to the new CVs, such as in records from SESAR, GEOME, and Open Context and some aspects of Smithsonian Darwin Core records. For example specimen type =liquid>aqueous in SESAR records mapped to specimen type = liquid or gas sample and material type = liquid water. A machine learning approach was applied to Smithsonian Darwin Core records to infer sampled feature terms from record text describing habitat, locality, higher geography, and higher classification fields. Applying fastText with a 600-billion-token corpus in the general domain, we provided the machine a level of “understanding” of English words. With 200 and 995-record training sets, 87%, 94% precision and 85%, 92% recall were obtained respectively, yielding performance sufficient for production use. Applying these approaches, more than 3x106 records of the four large collections have been mapped successfully to a common core data model facilitating cross-domain discovery and retrieval of the sample records.