Authorea

Merce Crosas edited discussion.tex over 10 years ago

Commit id: b92bb373e1e308dfffca8c127348619ac71be24c

deletions | additions

The authors of this article are involved in a project, in conjunction with this study, that both educates astronomers on data management practices and provides a technical solution to these problems. The approach on which the project is built is rather different than that of the Virtual Observatory. Rather than attempting to build a framework and a related set of standards and protocols, we focused on the implementation of an easy-to-use tool that can solve an immediate problem: the storage, citation, and discovery of secondary data in astronomy. In other words, we have found with this study that many astronomers today have derived data that "does not fit" in a scholarly paper. Where can they store these data upon publication with a certainty that they can be retrieved, cited and discovered? The technical solution we developed involves the use of the Dataverse Network (\url{http://thedata.org/}), a web application for sharing, citing, and archiving social science data (King 2007, Crosas 2011, Crosas 2013). (\cite{King2007}, \cite{Crosas2011}, \cite{Crosas2013}). The Dataverse Network is an open source software application, developed by the Institute for Quantitative Social Science (IQSS) at Harvard University (King 2014). (\cite{King2014}). The Dataverse software is multi-tier Java Enterprise Edition (EE) application with an underlying open-source relational database (PostgreSQL) for application data (such as users, roles, permissions) and metadata, and a file storage component for the actual data files. The application tiers include a user-interface layer which employs the latest Java Server Faces components suite, PrimeFaces, a business-logic layer implemented with Java Beans which is represented by the object model, and a persistent layer. The application runs in a Glassfish application server. This common Java EE architecture is robust and easily allows scalability and extensibility. The Dataverse enables discoverability by searching across all descriptive metadata or cataloging fields, in addition to information extracted from data files. The metadata is indexed using the Lucene Index Server (Solr, in particular, in Dataverse version 4). The metadata is also mapped to various standard metadata schemas, such as Dublin Core and DDI (http://www.ddialliance.org/), and exported to XML format for preservation purposes. All data files and any complementary files, such as code or documentation, are stored in a Network File System. A Dataverse Network consists of dataverses, and each dataverse can be branded or customized for an individual researcher, or group, or project, or journal. A dataverse owner has control over the branding, the metadata, and the sharing and release of the data, thus he can completely manage his own virtual data archive, while all data are stored in a centralized, public research data repository that guarantees proper archival and long-term access. The Dataverse Network follows good practices for scientific data publication: 1) supports metadata standards and enables the inclusion of accompanying code and other materials for each dataset, 2) provides versioning of a dataset, with easy access to previous versions of the data and metadata, 3) assigns a persistent identifier (DOI) and generates a full data citation, with attribution to data authors and distributors (Altman and King, 2007). (\cite{AltmanKing2007}). The generated data citation follows the recently proposed principles for data citation, and international initiative which recognizes that "data \begin{quote}data should be considered legitimate, citable products of research"(http://www.force11.org/datacitationprinciples). research\end{quote}(http://www.force11.org/datacitationprinciples). Once a dataset is released for publication, it cannot be unreleased, to guarantee that the data citation, and its persistent url, can always be resolved to a data page that includes sufficient information about the dataset and access to the data files. In some uncommon cases, a dataset might be deaccessioned due to a retraction or legal issue, but even in these cases, the persistent identifier in the data citation will still resolve to a page with information about the missing dataset. \textbf{TODO: Insert a paragraph or two explaining the technical makeup of the Dataverse. What is it built on? What database does it use? How is it structured? DONE ABOVE Merce}