Authorea

James Aaron Warren edited To Archive or Not to Archive .tex about 10 years ago

Commit id: 141bb8017b39a75d2828db68b484ccac2531ecef

deletions | additions

The most critical question to be answered in setting policies for publications is “what data should be archived?” The answer is essential in providing clear expectations for authors, editors, and reviewers, as well as determining the size of the data repositories needed. Other disciplines have already embarked on this journey and have devised a variety of approaches that suit the data needs of their community for their stage of “digital maturity.” Two ends of the spectrum in addressing this question are presented here. The first assumes all data supporting a publication are worthy of archiving. This criterion is found most often in peer reviewed journals that have narrow technical scope and generally deal with very limited data types. For example, journals in crystallography and fluid thermodynamics have very stringent data archiving policies that prescribe formats and specific repositories for the data submitted. , Other journals that cover broader technical scope, and therefore deal with more heterogeneous data, have implemented more subjective criteria for data archiving and a distributed repository philosophy. Earth sciences and evolutionary biology have typically taken this approach. It is likely that the approach adopted by MSE publications may also span a similar spectrum, depending on the scope of the publication. The MRS-TMS “Big Data” survey provided insight into the community’s perspective on the relative value of access to various types of materials data, Figure 2.1 It’s interesting to note that as the complexity of the data and metadata increase (generally), the community’s perceived need to have access to this data decreases. This could be due to many factors including the difficulty in assuring the quality of such data as well as the lack of familiarity with tools to handle the complexity. However, with complexity comes a richness of information that if properly tapped could be extraordinarily valuable. For those publications with wide technical scope, it will be difficult to provide a universal answer to “what data should be archived?” In these cases, the decision for what data to archive may best be left to the judgment of the authors, peer reviewers, and editors. A particularly useful metric might be the cost/effort to produce the data. For example, the “exquisite” experimental data associated with a high energy diffraction microscopy experiment provide very unique, expensive, and rich datasets with great potential use to other researchers. Clearly, based on these factors the dataset should be archived. On the other hand, the results from a model run on commercial software that takes five minutes of desktop computation time would likely not be worthy of archiving as long as the input data, boundary conditions, and software version were well defined in the manuscript. However, even the data from a simple tensile test may be worthy of archiving as publications do not typically provide the entire curve; while the paper may report only yield strength, another researcher may be interested in work hardening behavior. Having the complete dataset in hand allows another researcher to explore alternative facets of the material’s behavior. The basic elements of criteria for determining the data required for archiving could include: \begin{itemize} \item Are central to the main scientific conclusions of the paper \item Are likely to be usable by other scientists working in the field \item Are described with sufficient pedigree & provenance that other scientists can reuse them \item The cost of reproducing the dataset is substantively larger than the cost of archiving the fully curated dataset \end{itemize} Data itself can come in a variety of ‘processed’ levels including ‘raw’, ‘cleaned’, and ‘analyzed’. Such characterizations are, of course, subjective. Nonetheless, given the diversity of materials data, care will need to be taken in determining the appropriate amount of processing performed on a dataset to be archived. It is probably much more important at this stage of our digital maturity that the metadata accompanying the dataset provide sufficient pedigree and provenance to make the data useful to others, by defining the post-test processing performed. Another factor to consider in setting guidelines for which data need to be archived is the expected annual and continuing storage capacity required. A very informal survey of 15 peer-reviewed journal article authors in NIST and AFRL found that most articles in the survey had less than 2 GB of supporting data per paper. However, those papers reporting on emerging characterization techniques such as 3-D serial sectioning and high energy diffraction microscopy were dependent on considerably larger datasets, approximately 500 GB per paper. The time and resources required to upload (by authors) and download (by users) data files less than 2 GB are reasonable. Other disciplines have established data repositories to support their technical journals. Experience to date indicates that datasets of up to approximately 10 GB can be efficiently and cost effectively curated. Repositories such as www.datadryad.org, show that datasets of this magnitude can be indefinitely stored at a cost of $80 or less. However, datasets approaching 500 GB will very likely require a different strategy for storage and access. Thus a data repository strategy needs to consider this bimodal distribution of datasets. An additional factor when considering storage requirements is the high global rate of growth in materials science and engineering publications. Figure 3 shows the dramatic growth in the number of MSE journal articles published over the past two decades.