Data Repositories

Aside from crystallographic data repositories, there are at this time perhaps no dedicated materials data repositories that meet the required characteristics defined above. The materials science and engineering community does have numerous publically-accessible data repositories; however, the majority of these are associated with specific projects or research groups, and their persistence is therefore dependent on individual funding decisions. These repositories are primarily established to house and share the research data generated within a specific project or program. They generally don’t follow uniform standards for data and metadata, nor provision for data discoverability and citation. There are very few repositories established with the explicit objective of providing MSE with public repositories for accessible digital data. In short, publically accessible, built-for-purpose repositories and the associated infrastructure for access, safe storage and management still need to be developed and sustainably funded—this is the largest impediment to implementing viable data archiving policies. (See, for example, “Sustaining Domain Repositories for Digital Data: A White Paper”.\cite{Ember_2013})

Evolutionary biology, for example, allows a mix of repositories that meet established criteria. Such criteria may be as simple as requiring data cited to be permanently archived in data repositories that meet the following conditions:

  1. Publically accessible throughout the world

  2. Committed to archiving data sets indefinitely

  3. Allow bi-directional linking between paper and dataset

  4. Provide persistent digital identifier

One tempting option might be to take advantage of the on-line storage capability several journals already offer for supplementary materials accompanying journal articles. However, as presently constructed these are not amenable to best practices for dataset storage as they generally are not independently discoverable, searchable, separately citable, nor aggregated in one location. In fact, some publishers are reducing or eliminating supplementary file storage due to the haphazard structure and rules associated with their use. Further, new global government policies promoting open access to research works have the publishing industry in a state of flux with regard to their long-standing, subscription-based business model. Publishers have been extremely reticent in taking on a data archiving responsibility given the economic uncertainties in the publishing marketplace.\cite{discussion} Also, there is a risk that for-profit publishers might restrict access to digital data assets that are co-located with the journal.

As alluded to in the previous section, a fundamental consideration in repository design and/or selection is the level to which the repository will present structured versus unstructured data. Structured technical databases tend to be more useful to a technical community due their uniformity, as evidenced by their data reuse rate.\cite{acharya} A perfect construct would see the vast majority of materials data resident within structured repositories. A disciplined data structure provides enormous advantages to the researcher both in terms of data discoverability and confidence in its use. However, this structure must be enabled by the application of broader and deeper standards for data and metadata, standards that do not currently exist.

In all likelihood, like biology, MSE publications will be dependent on a collection of repositories that are tailored to specific materials data. For example, NIST is building and demonstrating a data file repository for CALPHAD and interatomic potentials.\cite{NISTMDR} These may be expandable and largely sufficient for thematic publications such as those devoted to thermodynamics and diffusion. However, repositories such as this will only fill a relatively small niche need in MSE. Integrating Materials and Manufacturing Innovation is piloting an effort to link articles with their supporting data using the NIST repository according to the criteria outlined above, an example can be found in an article by Shade et al.\cite{Shade_2013}\cite{Shade_data}

Finally, a business model for sustainably archiving materials data is required. Other technical fields, such as earth sciences, can at least partially rely on government-provided repositories for large and complex datasets. Without these types of repositories to build on, MSE will need to establish viable repository solutions. In response to funding agency requirements for data management plans some universities, Johns Hopkins for example, are beginning to provide centrally-hosted data repositories, but these are not yet common.\cite{jhudata} Private fee-for-service repository services, such as labarchives and figshare, are also evolving to meet growing demand for accessible data storage.\cite{labarchives,figshare} Additionally, ASM International is working to create a prototype materials data repository through its close association with Granta Design. Termed the Computational Materials Data Network (CMDN), this is an interesting option as the data repository will provide a structured database specifically for materials data, but the business model for CMDN has not yet been solidified.\cite{cmdn} A key open question remains how funding agencies will respond to the OSTP open research policy memo, and how they will fund activities making data open to the public.