Authorea

Charles H. Ward edited To Archive or Not to Archive .tex about 10 years ago

Commit id: e1114aaba24651eb0ace01b0bd1b8e2406d7bcee

deletions | additions

\item Is the dataset reproduceable at all, or does it stem from a unique event or experiment? \end{itemize} Data itself can come in a variety of ``processed'' "processed" levels including ``raw'', ``cleaned'', "raw", "cleaned", and ``analyzed''. "analyzed". Such characterizations are subjective, though some disciplines have adopted quite rigorous definitions. Nonetheless, given the diversity of materials data, care will need to be taken in determining the appropriate amount of processing performed on a dataset to be archived. It While raw or cleaned data is much preferred for its relative simplicity in reuse, it is probably much more important at this stage of our digital maturity that the metadata accompanying the dataset provide sufficient pedigree and provenance to make the data useful to others, including definition of the post-test post-acquisition (experiment or computation) processing performed.[I don't quite get "post-test processing"; this would seem to imply that all data derives from a "test", but that seems too narrow to me. bh. Agreed--is this better? cw] Another factor to consider in setting guidelines for which data need to be archived is the expected annual and continuing storage capacity required. A very informal survey of 15 peer-reviewed journal article authors in NIST and AFRL found that most articles in the survey had less than 2 GB of supporting data per paper. Currently the time and resources required to upload (by authors) and download (by users) data files less than 2 GB are quite reasonable. However, those papers reporting on emerging characterization techniques such as 3-D serial sectioning and high energy diffraction microscopy were dependent on considerably larger datasets, approximately 500 GB per paper. Other disciplines have established data repositories to support their technical journals. Experience to date indicates that datasets of up to approximately 10 GB can be efficiently and cost effectively curated.\cite{tvision} Repositories such as www.datadryad.org, show that datasets of this magnitude can be indefinitely stored at a cost of \$80 or less.\cite{datadryad} However, datasets approaching 500 GB will very likely require a different approach for storage and access. Thus a data repository strategy needs to consider this range in distribution of datasets. An additional factor when considering long-term storage requirements is the high global rate of growth in materials science and engineering publications. Figure 3 shows the dramatic growth in the number of MSE journal articles published over the past two decades, indicating a commensurate amount of accompanying data.