Authorea

To Archive or Not to Archive?

The most critical question to be answered in setting policies for publications is “what data should be archived?” The answer is essential in providing clear expectations for authors, editors, and reviewers, as well as determining the size of the data repositories needed. Other disciplines have already embarked on this journey and have devised a variety of approaches that suit the data needs of their community for their stage of “digital maturity.” Two ends of the spectrum in addressing this question are presented here. The first assumes all data supporting a publication are worthy of archiving. This criterion is found most often in peer reviewed journals that have narrow technical scope and generally deal with very limited data types. For example, journals in crystallography and fluid thermodynamics have very stringent data archiving policies that prescribe formats and specific repositories for the data submitted.\cite{actacryst,Koga_2013} Other journals that cover broader technical scope, and therefore deal with more heterogeneous data, have implemented more subjective criteria for data archiving and a distributed repository philosophy. Earth sciences and evolutionary biology have typically taken this approach. It is likely that the approach adopted by MSE publications may also span a similar spectrum, depending on the scope of the publication.

The MRS-TMS “Big Data” survey provided insight into the community’s perspective on the relative value of access to various types of materials data, shown in Figure \ref{fig:COMPLEX}. It’s interesting to note that as the complexity of the data and metadata increase (generally) toward the right-hand side of the chart, the community’s perceived need to have access to this data decreases. This could be due to many factors including the difficulty in assuring the quality of such data as well as the lack of familiarity with tools to handle the data complexity. However, with complexity comes a richness of information that if properly tapped could be extraordinarily valuable. In astronomy, for example, the Sloan Digital Sky Survey created a very complex database of attributes of stars, galaxies, and quasars. The wealth of information and immense discovery potential led many in the research community to become expert users of SQL, and for the survey to yield nearly 6,000 peer-reviewed publications.¹

For those publications with wide technical scope, it will be difficult to provide a universal answer to “what data should be archived?” In these cases, the decision for what data to archive may best be left to the judgment of the authors, peer reviewers, and editors. A particularly useful metric might be the cost/effort to produce the data. For example, the “exquisite” experimental data associated with a high energy diffraction microscopy experiment provide very unique, expensive, and rich datasets with great potential use to other researchers. Clearly, based on these factors the dataset should be archived. On the other hand, the results from a model run on commercial software that takes five minutes of desktop computation time may not be worthy of archiving as long as the input data, boundary conditions, and software version were well defined in the manuscript. Of course, one must account for the perishable nature of code, particularly old versions of commercial code. However, even the data from a simple tensile test may be worthy of archiving as publications do not typically provide the entire curve; while the paper may report only yield strength, another researcher may be interested in work hardening behavior. Having the complete dataset in hand allows another researcher to explore alternative facets of the material’s behavior. The basic elements of criteria for determining the data required for archiving could include:

Are the data central to the main scientific conclusions of the paper?
Are the data likely to be usable by other scientists working in the field?
Are the data described with sufficient pedigree and provenance that other scientists can reuse them in their proper context?
Is the cost of reproducing the dataset substantially larger than the cost of archiving the fully curated dataset?
Is the dataset reproduceable at all, or does it stem from a unique event or experiment?

Data itself can come in a variety of “processed” levels including “raw”, “cleaned”, and “analyzed”. Such characterizations are subjective, though some disciplines have adopted quite rigorous definitions. Nonetheless, given the diversity of materials data, care will need to be taken in determining the appropriate amount of processing performed on a dataset to be archived. While raw or cleaned data is much preferred for its relative simplicity in reuse, it is probably much more important at this stage of our digital maturity that the metadata accompanying the dataset provide sufficient pedigree and provenance to make the data useful to others, including definition of the post-acquisition (experiment or computation) processing performed.

Another factor to consider in setting guidelines for which data need to be archived is the expected annual and continuing storage capacity required. A very informal survey of 15 peer-reviewed journal article authors in NIST and AFRL found that most articles in the survey had less than 2 GB of supporting data per paper. Currently the time and resources required to upload (by authors) and download (by users) data files less than 2 GB are quite reasonable. However, those papers reporting on emerging characterization techniques such as 3-D serial sectioning and high energy diffraction microscopy were dependent on considerably larger datasets, approximately 500 GB per paper. Other disciplines have established data repositories to support their technical journals. Experience to date indicates that datasets of up to approximately 10 GB can be efficiently and cost effectively curated.\cite{tvision} Repositories such as www.datadryad.org, show that datasets of this magnitude can be indefinitely stored at a cost of $80 or less.\cite{datadryad} However, datasets approaching 500 GB will very likely require a different approach for storage and access. Thus a data repository strategy needs to consider this range in distribution of datasets. An additional factor when considering long-term storage requirements is the high global rate of growth in materials science and engineering publications. Figure \ref{fig:GROWTH} shows the dramatic growth in the number of MSE journal articles published over the past two decades, indicating a commensurate amount of accompanying data.

This is based on a query to the Astrophysics Data System, http://adsabs.harvard.edu/, for peer-reviewed papers mentioning either “SSDS” or “Sloan” in the title or abstract of the paper. A query executed on April 9, 2014 resulted in 5,825 papers.↩