Authorea

Charles H. Ward edited To Archive or Not to Archive .tex about 10 years ago

Commit id: 498829c6b4cb65d04710e8c76d5a4bbf7cb77553

deletions | additions

\item Is the dataset reproduceable at all, or does it stem from a unique event or experiment? \end{itemize} Data itself can come in a variety of ``processed'' levels including ``raw'', ``cleaned'', and ``analyzed''. Such characterizations are subjective, though some disciplines have adopted quite rigorous definitions. Nonetheless, given the diversity of materials data, care will need to be taken in determining the appropriate amount of processing performed on a dataset to be archived. It is probably much more important at this stage of our digital maturity that the metadata accompanying the dataset provide sufficient pedigree and provenance to make the data useful to others, including definition of the post-test (experiment or post-computation computation) processing performed. [I don't quite get "post-test processing"; this would seem to imply that all data derives from a "test", but that seems too narrow to me. bh. Agreed--is this better? cw] Another factor to consider in setting guidelines for which data need to be archived is the expected annual and continuing storage capacity required. A very informal survey of 15 peer-reviewed journal article authors in NIST and AFRL found that most articles in the survey had less than 2 GB of supporting data per paper. Currently the time and resources required to upload (by authors) and download (by users) data files less than 2 GB are quite reasonable. However, those papers reporting on emerging characterization techniques such as 3-D serial sectioning and high energy diffraction microscopy were dependent on considerably larger datasets, approximately 500 GB per paper. Other disciplines have established data repositories to support their technical journals. Experience to date indicates that datasets of up to approximately 10 GB can be efficiently and cost effectively curated.\cite{tvision} Repositories such as www.datadryad.org, show that datasets of this magnitude can be indefinitely stored at a cost of \$80 or less.\cite{datadryad} However, datasets approaching 500 GB will very likely require a different strategy for storage and access. Thus a data repository strategy needs to consider this bimodal distribution of datasets. An additional factor when considering storage requirements is the high global rate of growth in materials science and engineering publications. Figure 3 shows the dramatic growth in the number of MSE journal articles published over the past two decades, indicating a commensurate amount of accompanying data.