Shelley Stall

and 9 more

Research data are a vital component of the scientific record. Discovering and assessing data for possible reuse in future research is challenging. The Belmont Forum has recently awarded funds to three international teams as part of a four-year Collaborative Research Action (CRA) on Science-driven e-Infrastructure Innovation (SEI) for the Enhancement of Transnational, Interdisciplinary and Transdisciplinary Data Use to improve data management practices that will increase data reuse. One of these awardees, PARSEC, comprises two interwoven strands, one focused on improving data practices for reuse and credit, and one for synthesis science. The data specialists work alongside synthesis science researchers as they determine the influence of natural protected areas on socioeconomic outcomes for local communities. They collaborate with the researchers to better understand their motivations and work practices, and to aid them in the data-related steps that need to be taken during the research lifecycle. This will ensure their data and code are FAIR-compliant and thus enhance the likelihood of their data being reused and their analyses reproducible. The PARSEC team is working with Research Data Alliance (RDA), Earth Science Information Partners (ESIP), DataCite and ORCID to build awareness of the elements required for data creators to receive credit and automated attribution for their data contributions, and the tools that will make it easier to observe usage. Credit for data is an important incentive for researchers to make their data reusable. When data are FAIR and cited, their related publications have higher visibility. We shall discuss various ways in which we are working across the science-data interface in our multi-country and multi-disciplinary working environment to improve data (and code) reuse through better management and crediting. Make your Data FAIR, Cite your Data, Get Credit, Increase Reuse and reap the rewards!

Margaret O'Brien

and 3 more

Essential Biodiversity Variables (EBVs) are state variables that lie between primary measurements and high-level indicators, and are necessary for assessment of the health and prognosis of Earth’s biosphere. EBVs represent the complete spectrum of biological diversity from genes to ecosystems, and so are based on observations which themselves are highly diverse, and typically human-collected or analyzed. What is now sorely needed are structured dictionaries of biological measurements that data collectors, curators and nascent biodiversity programs can reference at all stages of planning and data organization. Similarly, analysts working with data defined according to these measurement dictionaries, require assurance that their results are comparable across scales and institutions. Full understanding of primary measurements will ideally require machine-readable, interpretable, and interoperable descriptions of the measurement contents, collection methods, data-typing, dimensions and associated units for physical quantities, and specification of appropriate temporal and spatial scales, plus the relationships among those attributes and facets of the ecosystem. Formal ontologies, i.e. vocabularies built using modern Semantic Web technologies, now provide the ideal tools and protocols for structuring and operationalizing EBV primary measurements. Here we illustrate an approach to apply these to existing data sets (both primary and harmonized intermediates) using community-accepted measurement ontologies under development. Such techniques can streamline the discovery and integration of observations, assist with calibration/validation checks required for automated or remote data collection, and enable rigorous structured definitions for modeled or remotely-sensed EBVs as these are developed.

Margaret O'Brien

and 2 more

Data repositories and research networks worldwide are publishing a diverse array of long-term and experimental data for meaningful reuse, repurpose, and integration. However, in synthesis research the largest time investment is still in discovering, cleaning and combining primary datasets until all are completely understood and converted to a usable format. To accelerate this process, we have developed an approach to define flexible domain specific data models and convert primary data to these models using a light-weight and distributed workflow framework. The approach is based on extensive experience in synthesis research workflows, takes into account the distributed nature of original data curation, satisfies the requirement for regular additions to the original data, and is not determined by a single synthesis research question. Furthermore, all data describing the sampling context are preserved and the harmonization may be performed by data scientists that are not specialists in each specific research domain. Our harmonization process is 3-phased. First, a Design Phase captures essential attributes, considers already existing standardization efforts, and external vocabularies that disambiguate meaning. Second, an Implementation Phase publishes the data model and best practice guides for reference, followed by conversion of relevant repository contents by data managers, and creation of software for data discovery and exploration. Third, a Maintenance Phase implements programmatic workflows that run automatically when parent data are revisioned using event notification services.In this presentation we demonstrate the harmonization process for ecological community survey data and highlight the unique challenges and lessons learned. Additionally, we demonstrate the maintenance workflow and data exploration and aggregation tools that plug in to this data model