Charles H. Ward edited To Archive or Not to Archive .tex  about 10 years ago

Commit id: bf1811b6e88ec4421fd6b6a93c6ec2a26bb7e501

deletions | additions      

       

Data itself can come in a variety of "processed" levels including "raw", "cleaned", and "analyzed". Such characterizations are subjective, though some disciplines have adopted quite rigorous definitions. Nonetheless, given the diversity of materials data, care will need to be taken in determining the appropriate amount of processing performed on a dataset to be archived. While raw or cleaned data is much preferred for its relative simplicity in reuse, it is probably much more important at this stage of our digital maturity that the metadata accompanying the dataset provide sufficient pedigree and provenance to make the data useful to others, including definition of the post-acquisition (experiment or computation) processing performed.    Another factor to consider in setting guidelines for which data need to be archived is the expected annual and continuing storage capacity required. A very informal survey of 15 peer-reviewed journal article authors in NIST and AFRL found that most articles in the survey had less than 2 GB of supporting data per paper. Currently the time and resources required to upload (by authors) and download (by users) data files less than 2 GB are quite reasonable. However, those papers reporting on emerging characterization techniques such as 3-D serial sectioning and high energy diffraction microscopy were dependent on considerably larger datasets, approximately 500 GB per paper. Other disciplines have established data repositories to support their technical journals. Experience to date indicates that datasets of up to approximately 10 GB can be efficiently and cost effectively curated.\cite{tvision} Repositories such as www.datadryad.org, show that datasets of this magnitude can be indefinitely stored at a cost of \$80 or less.\cite{datadryad} However, datasets approaching 500 GB will very likely require a different approach for storage and access. Thus a data repository strategy needs to consider this range in distribution of datasets. An additional factor when considering long-term storage requirements is the high global rate of growth in materials science and engineering publications. Figure 3 \ref{fig:GROWTH}  shows the dramatic growth in the number of MSE journal articles published over the past two decades, indicating a commensurate amount of accompanying data.