Authorea

Alberto Pepe edited discussion.tex almost 10 years ago

Commit id: 7600d4c22ed8058ce4237745415b07d8687e8e90

deletions | additions

\section{Discussion} With this study we found that, overall, astronomers are increasingly willing to reference and share the secondary or processed data sets used to derive the results in their publications. However, these same astronomers have failed to embrace a common infrastructure to share these types of data sets. It is also unclear if this finding is because such infrastructure is lacking or because it is unknown to (or untenable as a solution for) most astronomers. Interestingly, astronomy, as a field, has \textbf{pioneered} the creation of international initiatives for the collection, organization, and sharing of data. Large archives that serve primary data sets have embraced the "virtual observatory" (VO) concept for more than a decade. Yet astronomy's failure to provide a data sharing solution for smaller derived data sets is worth a deeper discussion in light of our survey results. \subsection{The virtual observatory} Focusing on efforts in the United States to facilitate a virtual observatory, we note that the 2000 decadal review by the National Research Council called for the creation of a "National Virtual Observatory" as its highest small initiatives priority. It was enacted with a grant from the National Science Foundation in 2001, entitled "Building the Framework for the National Virtual Observatory." (See \href{http://virtualobservatory.org/whatis/history.aspx}{this link} for a history of the US Virtual Observatory efforts.) The grant essentially implemented a vision for sharing astronomy data online put forward in a \textit{Science} article about "The WorldWide Telescope" by Szalay and Gray in 2001 \cite {2001Sci...293.2037S}. \cite{2001Sci...293.2037S}. The scope of this research was broad, including standards development and professional outreach to scientists (See \cite{vobook}). In 2010, NASA and NSF reached a cooperative agreement to fund and maintain a US Virtual Astronomical Observatory, implementing the research done under the 2001 Framework grant as a formal structure for tool and standards development, as well as a venue for professional and public outreach about the VO. Unfortunately, NSF announced plans (now being implemented) to de-fund its (80\%) share of the US VAO, leading to a cessation of the US VAO in September 2014. Opinions on why and how this happened are beyond the scope of this paper. What is important for our purposes is to point out that 1) the scope for both the NVO and VAO efforts skewed toward serving large, homogenous datasets; 2) the most robust, important and adopted infrastructure-related efforts of the VAO, like the VO "Registry" essential for tools to find data, are not at all secure from funding cuts. These two facts we feel have sought to undermine the ability of the VO to serve the data sharing needs of astronomers while also putting doubt in the minds of astronomers thinking about doing extra work to share their data. It is worthwhile to summarize the successes of these VO development efforts. Certainly, large archives of primary data have embraced standards based data access and sharing, which is a success that will have long lasting impacts. It is these rich interfaces that enable the creation of the kinds of data aggregation tools envisioned by Szalay & Gray. Some tools, such as the recently released \href{http://vao.stsci.edu/portal/Mashup/Clients/Portal/DataDiscovery.html}{US VAO Data Discovery tool} could not exist without VO tools like the "Registry" and data access protocols that have been adopted by the archives. In 2008, Microsoft Research released a free software package named "WorldWide Telescope" (WWT), in honor of Szalay and Gray's 2001 vision. Today, WWT, which uses a large amount of infrastructure established under the NVO and VAO grants, and connects to many services developed outside the US (under the "International Virtual Observatory Alliance" standards) is probably the best US-origin implementation of the virtual observatory vision of connected datasets. The combination of tools offered by the Centre de Donnees astronomiques de Strasbourg (\href{http://cds.u-strasbg.fr}{CDS}) also offer excellent access to VO services. Many data sets from NASA and other large survey providers are available within WWT and CDS tools, and astronomers can offer their own data in these frameworks as well, but uptake is still slower than one might imagine. One example of a medium-size survey (\href{http://www.cfa.harvard.edu/COMPLETE/data_html_pages/data.html}{COMPLETE}) being served at a research group's web site using an HTML5 WWT client is \href{http://www.worldwidetelescope.org/complete/wwtcoveragetool5.htm}{here}. A summary of the usage and functionality of WWT in research and education is offered in \citet {2012ASPC..461..267G}. \subsection{The Dataverse Network} The authors of this article are involved in a project, in conjunction with this study, that both educates astronomers on data management practices (e.g., \cite{citationprinciples, tenrules}) and provides a technical solution to these problems. The approach on which the project is built is rather different than that of the virtual observatory. Rather than attempting to build a framework and a related set of standards and protocols, we focused on the implementation of an easy-to-use tool that can solve an immediate problem: the storage, citation, and discovery of secondary data in astronomy. In other words, we have found with this study that many astronomers today have derived data that "does not fit" in a scholarly paper. Where can they store these data upon publication with a certainty that they can be retrieved, cited and discovered? The technical solution we developed involves the use of \href{http://thedata.org/}{the Dataverse Network}, a web application for sharing, citing, and archiving social science data \cite{King2007}, \cite{Crosas2011}, \cite{Crosas2013}. The Dataverse Network is an open source software application, developed by the Institute for Quantitative Social Science (IQSS) at Harvard University \cite{King2014}. The Dataverse software is a multi-tier Java Enterprise Edition (EE) application with an underlying open-source relational database (PostgreSQL) for application data (such as users, roles, permissions) and metadata, and a file storage component for the actual data files. Theapplication tiers include a user-interface layer which employs the latest Java Server Faces components suite, PrimeFaces, a business-logic layer implemented with Java Beans which is represented by the object model, and a persistent layer. The application runs in a Glassfish application server. This common Java EE architecture is robust and easily allows scalability and extensibility. The Dataverse enables discoverability by searching across all descriptive metadata or cataloging fields, in addition to information extracted from data files. The metadata isindexed using the Lucene Index Server (Solr, in particular, in Dataverse version 4). The metadata is also mapped to various standard metadata schemas, such as Dublin Core and DDI (http://www.ddialliance.org/), and exported to XML format for preservation purposes. % The metadata is indexed using the Lucene Index Server (Solr, in particular, in Dataverse version 4). % All data files and any complementary files, such as code or documentation, are stored in a Network File System. %The application tiers include a user-interface layer which employs the latest Java Server Faces components suite, PrimeFaces, a business-logic layer implemented with Java Beans which is represented by the object model, and a persistent layer. The application runs in a Glassfish application server. This common Java EE architecture is robust and easily allows scalability and extensibility. A Dataverse Network consists of dataverses, and each dataverse can be branded or customized for an individual researcher, or group, or project, or journal. A dataverse owner has control over the branding, the metadata, and the sharing and release of the data, thus he she can completely manage his own virtual data archive, while all data are stored in a centralized, public research data repository that guarantees proper archival and long-term access. The Dataverse Network follows good practices for scientific data publication: 1) supports metadata standards and enables the inclusion of accompanying code and other materials for each dataset, 2) provides versioning of a dataset, with easy access to previous versions of the data and metadata, 3) assigns a persistent identifier (DOI) and generates a full data citation, with attribution to data authors and distributors (\cite{AltmanKing2007}). The generated data citation follows the recently proposed principles for data citation, and international initiative which recognizes that 'data should be considered legitimate, citable products of research' \cite{citationprinciples}. Once a dataset is released for publication, it cannot be unreleased, to guarantee that the data citation, and its persistent url, can always be resolved to a data page that includes sufficient information about the dataset and access to the data files. In some uncommon cases, a dataset might be deaccessioned due to a retraction or legal issue, but even in these cases, the persistent identifier in the data citation will still resolve to a page with information about the missing dataset. \subsection{TheAstroData: an Astronomy Dataverse Network} After an analysis of existing Dataverse Network repositories --- most of which host social science data --- we discovered that the Dataverse Network software could be slightly adapted and repurposed to host astronomical data. This adaptation consisted of two main enhancements to to the Dataverse software: 1) a flexible, extensible metadata schema that could support fields typically needed to describe a dataset in Astronomy, and 2) deep search for FITS files, that is, indexing FITS files header information to facilitate discovery of such files. Both enhancements are in continue development, as the Dataverse team receives feedback from the astronomy community through usability testing and iterations of the software. The %The metadata will be further enhanced in version 4 of the project, following standards from generic VO metadata fields. The result of this project is \href{http://theastrodata.org/}{TheAstroData.org}, a free open-access database to host astronomy-related derived data. At the moment, the database is open to all scientific data from astronomical institutions worldwide. Administration and support is provided by the Harvard-Smithsonian Center for Astrophysics in collaboration with Harvard Library and the IQSS. Infrastructure is provided by Harvard University Information Technology Services. The hosting architecture consists of multiple load-balanced application servers and database servers, where additional servers can be added when user volume and requests increase. The data storage can also be easily increased on demand by adding additional space in the Network File System. Data files and metadata are backed up hourly and the application/system files and databases are backed up daily. In addition, data files and metadata are archived in multiple locations using LOCKSS (Lots of Copies Keep Stuff Safe).Our TheAstroDataproject, following what has been successful with the Dataverse projectin social science, which hosts more than 50,000 datasets and 700,000 files, intends toalso achieve two main goals, both critical in data sharing: \begin{enumerate} \item provide an easy-to-use central repository where (small) astronomy data sets can be deposited and archived for long term access, and \item provide a data citation for every dataset uploaded. The citation includes a persistent identifier which links to the data, and can be added to the the references sections of any publication. \end{enumerate} For the everyday astronomer TheAstroData flips the equation of data sharing in a virtual obsevatory context on its head. It trades interoperability that comes with homogenized data sets for ease of data sharing by astronomers. Search functions focus on descriptive metadata instead of quantified slicing of datasets by physical quantities such as location on the sky. This trade off is not permanent, and we assert that the kinds of data access envisioned by Szalay & Gray \citet{2001Sci...293.2037S} for small published datasets can be achieved ex post facto. Our plans are to re-index (or expose the file level metadata related to) shared data files, extracting addtional numerical metadata fields to enable finer grain search. Further, the audience for TheAstroData is completely transparent and focused on individual scientists or projects that have dervied derived (and often heterogeneous) datasets to share or to publish along side a refereed paper. It is already the case that TheAstroData datasets are linked to literature publication records in two ways. Foremost, we provide primary publication-to-dataset links to the SAO-NASA Astrohpysical Astrophysical Data System \href{http://adsabs.harvard.edu/}{ADS}, which is the universal literature resource for all of astronomy; an astronomer's TheAstroData datasets appear as "Data Archive" links in the primary publication's ADS record. Second, our records are listed in the \href{http://wokinfo.com/products_tools/multidisciplinary/dci}{Thomson-Reuters Data Citation Index}, which makes use of the Dataverse Network's OAI-PMH harvesting interface. Our future plans include transmutating the rich DDI metadata standards adopted by the Dataverse Network and enhanced with our astronomy specific extensions means into VO standards and exporting this version to indexing tools such as the VO Registry (or similar data publishing registry). We anticipate that our adoption of the Dataverse Network In addition to providing a curation and long-term preservation plan for derived data in astronomy, TheAstroData has two additional benefits for everyday astronomers: \begin{enumerate} \item In future iterations of TheAstroData, astronomers. First, it natively supports data analysis capabilities, and we plan to reuse integrate it with existing tools for thedata analysis capabilities and visualization of astronomy datasets. Second, theDataverse Network software to allow integration of astronomers FITS data with new visualization or analysis tools, for example, GlueViz \url{glueviz.org}; \item The stamping of TheAstroData datasets with a standardized data citation will facilitate the adoption of data citation by publishers - it is critical that this type of citations ``citations to data" become part of the references sections in publications, and are easily traceable to derive their impact.We are in conversations with relevant astronomer publishers. \end{enumerate} % Also as in the case of Social Science, the central repository not only serves as a mere file system to drop and access data files, but instead provides the tools to understand the nature of the data sets and how they can be reused. It accomplishes this by allowing to add descriptive metadata about the data set and complementary files such as documentation and code, and extracting metadata automatically from the data file. In quantitative social science, the most common data formats are R, SPSS and STATA, formats that allow researchers to have rich statistical metadata for data tables. These data files are recognized by the Dataverse software and the rich metadata is extracted not only for searching, but also for providing summary statistics and analysis tools for these data types. Our extension in astronomy is to provide similar rich functionality for FITS files; in the first iteration to support searching of FITS files, but in future iterations, to allow integration with visualization or analysis tools. The Dataverse also provides the infrastructure to export that metadata and make it accessible through the Open Archive Initiative (OAI) protocol, or through data and metadata RESTful APIs, so that it can be easily harvested by other systems and make the datasets more easily discoverable by the astronomy community. %A formal data citation is the other key piece of data sharing. It provides a persistent link between the publication and the data set, so that if the location of the data set changes in the future, the persistent link can still be resolved to the same data set. It also provides attribution to the various contributors - authors and data producers or providers - properly given credit to the authors that collected and process the data. Finally, a formal, standardized data citation is needed to facilitate the adoption of data citation by publishers - it is critical that this type of citations become part of the references sections in publications, and are easily traceable to derive their impact.