Digital Object Identifiers for Astronomical Data Archives and Journals: Some Initial Principles Developed at MAST

#Introduction

It is understood in the scientific community that being able to easily identify the data used in research allows scientists to either reproduce and verify the results reported, or build on the findings to produce new results. The Mikulski Archives for Space Telescopes (MAST) team at the Space Telescope Science Institute (STScI) began an exploration of Digital Object Identifiers (DOIs) as a way to directly attach MAST data to scientific results in the literature. A digital object identifier provides an "actionable, interoperable, persistent link" [1] which relies on a structure which has been standardized and used globally since 2000. MAST staff identified the following issues they considered resolvable through the use of DOIs:

  • MAST data used in journal articles are referred to in an inconsistent fashion, making it at times difficult for later authors to reproduce or expand on the work.
  • Links to data provided in articles tend to decay (rot) over time. (see pepe et al)
  • It is difficult and time consuming for MAST/STScI staff to track precisely what data are being used for attribution and telescope bibliometrics.

A proposal was put forward and implemented for a service with which authors can find or generate Digital Object Identifiers that are associated with the data they analyzed in their publication. Early in the process STScI staff recognized an alliance with an established journal would be critical to the success of the project. American Astronomical Society (AAS) Journals were an obvious partner, due to their large marketshare of MAST publications (~50%) and ongoing collaborations with MAST astronomers. MAST staff call this The DOI Project internally and the MARC project with our partners.

[1] https://www.doi.org/factsheets/DOIKeyFacts.html

Some Principles

In the process of putting together the tools and workflows for our users, we developed a number of shared principles between STScI and AAS Journals for the DOI/MARC project.

Incentives

Users are more likely to use a system that provides an clear benefit to them. To this end, we present the higher citation rates of articles with linked data (Henneken & Accomazzi, 2012) [2] and fast rate of decay of standard URLs (Pepe 2014) to explain the benefit of the DOI/MARC system for users. Similarly, users are more likely to comply with requirements when presented with a well-built tool. This has led MAST to do extensive user testing with the DOI system. Lastly, astronomers are slow to take up new tools without encouragement. Thus in the DOI/MARC program AAS actively solicits DOIs from authors upon submission, rather than simply accepting MAST DOIs if provided.

Partnership

Our approach can only work if the correct parties are providing incentives and solicting authors. While STScI can ask that MAST users follow various guidelines, and even use instrument and time allocation policies to influence authors, it is hard to get authors to comply with more detailed requirements. The journal publisher, who can directly set standards for publication, can also set expectations for data citation. To solve this problem, MAST formed a partnership with AAS Journals. The conditions of the partnership were that MAST would provide an easy, elegant way for users to point to the data they used in their papers if the journal publisher was willing to solicit users for a DOI identifying MAST data they discuss in their paper. AAS does not deny publication to non-compliant users. AAS does not check if an article has MAST data -- the users are expected to self-report.

Minimal Integration

DOIs are passed between MAST and AAS simply by cut-and-paste into a webform. There is no handoff of data happening behind the scenes. This has the major benefit of easy federation, and no additional standards. The expectation is that in the future AAS Journals will work with other archives to integrate DOIs into articles; similarly MAST will partner with other journals. We could not see a motivation for a more complex system.

Fixed DOIs and Custom DOIs

In some situations, fixed existing DOIs are very useful. In many cases, specific large data sets are used as a whole in a manuscript. For example, MAST High Level Science Products (HLSPs) like CLASH, CANDELS, or GOODS [3] and are often referred to in papers. Providing a consistent, single, persistent link to these data from manuscripts provides value to readers, authors, and librarians. Having a set of fixed DOIs for popular products available for author selection eliminates the problem of minting multiple data set DOIs to refer back to the same observations.

In contrast, an author may also need a mechanism for referring to sets of previously unrelated observations. The MAST Portal DOI tool [4], built by Weissman & Donaldson, allows users to concatenate observations and make a new DOI at will. Authors are given flexibility to create one DOI for all observations in a article or a multiple DOIs for different sets of observations. For larger data sets (e.g. source catalogs, entire quarters of Kepler observations) MAST currently only provides a single DOI to the data set.

Data DOIs are Not First-Class Citeable Objects

While DOIs are often used for first-class citable objects (e.g. journal articles), MAST and AAS are in agreement that data DOIs are not first-class citeable objects on their own, and thus do not show up in the bibliography. The logic behind this decision is that an author wishing to return to the data set for future study should cite the paper in which the data set was first identified and analyzed, along with the full context of the data use. The reality is, citeable objects hold a specific meaning to astronomers and academics in general, and the idea of being able to mint an unlimited number of them does not work with the way we currently consider citations. DOIs are thought of as "permalinks", and are used much the same way that simple URLs are used to point to journal articles today.

DOIs Refer to the Described Data Set

Because the intention of the DOI is to describe the data set analyzed in the paper, we do not force to DOI to be forever fixed in the case where the author made some kind of mistake in DOI generation. Our protocol is for MAST to check with the AAS before changing the content of the DOI, but we all generally believe that it makes sense to allow users to edit their DOIs to match the content of their papers.

Cross-Archival Data

The question posed to authors, asking if they used MAST data, had to be revised a number of times because data for some telescopes such as XXM-OM, Swift UVOT, and Gaia can be found in archives other than MAST. Since the problem can be politically thorny, we came to the agreement that in the future users should select the archive where they retrieved the data, as they will be most familiar with the interfaces provided by that archive. It is not reasonable for users to know all the places data from an instrument or survey are stored.

[2] https://ui.adsabs.harvard.edu/#abs/2012ASPC..461..763H/abstract

[3] http://archive.stsci.edu/hlsp/

[4] https://mast.stsci.edu/portal/Mashup/Clients/DOI/DOIPortal.html

Technial Process/Workflow

In an effort to encourage more observatories and journal publishers to adopt DOI implementation, we are outlining the simple workflow we established.

  1. 1.

    Author begins paper submission on EJ Press website to submit to AAS titles: Astrophysical Journal, Astrophysical Journal Letters, or Astrophysical Journal Supplement. EJPress submission form asks whether data from the MAST archive was used in publication. At this time, only submitting authors whose email domain ends in @stsci.edu are prompted. STScI is beginning to expand the domains and invited institutions at time of thisd article’s publication.

  2. 2.

    Author specifies whether MAST data was used. ”No” reroutes the author back to finish the paper submission process on EJ Press. ”Yes” routes the author to an STScI DOI landing page [4].

  3. 3.

    From the landing page, authors are asked if they used 1) a collection of specific, curated observations (custom DOI); 2) data from a High-Level Science Product; 3) a catalog, e.g., Kepler/GALEX; or 4) a large, clear sub-section of a catalog, e.g. a quarter of Kepler long cadence data. Authors who select options 2, 3, or 4 are prompted to select an existing, fixed DOI(s) from a list. Authors who select option 1 are directed to a custom version of the MAST portal that allows the individual to select observations used in their research. As noted, authors have the liberty to mint a single DOI for all observations or a subset of DOIs for different sets of data.

  4. 4.

    When creating a custom DOI, the author submits basic metadata such as their name. Other metadata, such as date and data set IDs, are auto-assigned.

  5. 5.

    Once the custom DOI(s) is created, the author is taken back to the EJ Press submission form where they are asked to cut and paste the DOI(s) and complete the paper submission. The author also receives an automated email with their information and summary of the DOI metadata.

At this point, the author has completed their end of the process.

[I wrote out the steps, but I think it’s worth substituting the flowchart Josh already created for the MUG presentation - Jenny]

When the DOI concept for data identification was first discussed, there were conflicting ideas on when the DOI should be minted. Should users who are exploring the MAST portal for the first time and gathering their preliminary data sets be permitted to mint a DOI? Are there reasonable concerns that the author may have used a more limited set by the time to began their data analysis in earnest? Is asking the author to mint a DOI at the end of the research process wise? What if he or she did not keep meticulous track of the data used in the actual analysis? In the end, we decided to opt for an end of pipeline process, with the assumption that the authors kept records of the data set used in their analysis. Moreover, DOI minting is an at-cost service that your institution must arrange with an outside vendor, so we felt it wise to minimize DOI creation when not tied to an official publication. The Space Telescope Science Institute integrates the EZID service through the California Digital Library. Other well-known service providers include DataCite.

Another consideration in developing workflows between archives and publishers is the role of the journal publisher and/or observatory in reporting non-compliance. STScI and AAS came to an agreement that AAS staff would _initially_track instances of non-compliance so they could report back to STScI. STScI is then responsible to follow up with authors to find out why they elected not to mint a custom DOI or select a fixed DOI, or were unable to do so. Tracking non-compliance allows you to contact authors who were eligible to mint a DOI, but chose not to and find out what about the process prevented them or deterred them from using the service. Was the initial question about data use unclear? Where instructions confusing? What technical challenges did the submitting author run into when trying to create a DOI(s)? Starting with your own institute allows you to use local staff as a test bed for the DOI service, but you should keep in mind that internal staff may be more savvy at navigating your data archive than the general community. This level of tracking is impossible for a Journal to institute wholesale, and thus as the service is optimized detailed tracking must wind down.

It is important to be able to have these conversations with the researchers early on, but keep in mind that tracking these statistics can be onerous, and MAST took the approach that a limited burden should be place on the publishers.

[4] http://archive.stsci.edu/doi/search/

A workflow diagram for an author interacting with the eJournal Press AAS Journals submission page. Green indicates eJournal Press site functions, yellow indicates Author actions on eJournal Press site, blue indication actions on MAST site.

Future Developments for STScI and AAS

STScI is considering the next steps and other potential uses for digital object identifiers. A pilot program to encourage authors to submit high-level science products (HLSPs) back to MAST based on data obtained and reprocessed is the next step. HLSPs are observations, catalogs, or models that complement, or are derived from, MAST-supported missions. They can include observations from other telescopes, or data that have been processed in a way that differs from what’s available in the archive. Authors will be able to mint an output DOI in addition to the input DOI described above in this document and link their unique high-level science product back to their publication. Data contributed as HLSPs at MAST will receive a permanent archive on the web and will be integrated with the other data at MAST, increasing discoverability.

STScI has also begun assessing the potential value of minting DOIs for existing data sets identified in the literature. There is a question of what the ultimate purpose would be of such a massive undertaking and whether the institute has the manpower to commit to retrospective data identification and DOI minting. There may be a place for citizen science if it is determined to be a priority.

Finally, STScI has investigated mechanisms for providing DOIs for large subsets of catalogs, replacing, for instance, laborious descriptions or publishing SQL queries.

While the above efforts are worthwhile, the major focus will continue to be facilitating output DOIs. It is STScI’s goal to make digital object identifiers for astronomical data archives and journals a standard in the field by the time the first scientific observations are reported from the James Webb Space Telescope.

[Someone else is editing this]

You are editing this file