Data Quality

A key concern in linking datasets to publications is the provision of quality metrics, that is, can the data’s ultimate reliability be assessed in a meaningful manner? Materials data can be provided as two basic types: experimental and computational; both types assume underlying models. In order for data and these associated models to be usable, their quality must be ascertained. In this context, it is useful to define the following for data and models:

  • Pedigree – Where did the information come from?

  • Provenance – How was the information generated (protocols and equipment)? This metadata should be sufficient to reproduce the provided data.

In addition to these qualitative descriptors of the data, there are any number of meaningful quantitative measures of the data’s quality. However, in general the following metrics are a strong basis for such an assessment:

  • Verification – (Applies to computational data only). How accurately does the computation solve the underlying equations of the model for the quantities of interest?

  • Validation – How much agreement is there between realizations of a model in experiment and computational, or, rarely, analytic, results?

  • Uncertainty – What is the quantitative level of confidence in our predictions?

  • Sensitivity – How sensitive are results to changes in inputs or upon assumed boundary conditions?

Similar, and perhaps more difficult problems pertain to simulation data. While such data may be perfectly precise in a numerical sense, simulations typically rely on many parameters, assumptions, and/or approximations. In principle, if the above are specified, and the quantitative metrics meet user requirements, the data can be used with high level of confidence. A similar approach to defining data quality was recently proposed within the context Nanotechnology Knowledge Infrastructure Signature Initiative within the National Nanotechnology Initiative \cite{DRLs}.

An often posed question in the research community with regard to data associated with peer-reviewed journal articles is that of peer-review of the data itself. Indeed, it has been reported that approximately 50% of data being reviewed for submission to the The American Mineralogist Crystal Structure Database contained errors \cite{downs}. The elements defined above represent the key criteria by which to judge the quality of the data. General pedigree and provenance information are typically conveyed in most research articles, though they may be provided in insufficient detail to reproduce the data. The remaining elements of validation, verification, uncertainty and sensitivity are relatively loosely defined within materials science and engineering, and best practices have not generally been developed for each element, or, where developed, are not in widespread use.