Authorea

Alberto Pepe Changed urls to hrefs over 10 years ago

Commit id: 1b17787b5d3642c5fc9c195b23a31c2189729265

deletions | additions

\subsection{The virtual observatory} Focusing on efforts in the United States to facilitate a virtual observatory, we note that the 2000 decadal review by the National Research Council called for the creation of a "National Virtual Observatory" as its highest small initiatives priority. It was enacted with a grant from the National Science Foundation in 2001, entitled "Building the Framework for the National Virtual Observatory." (See \url {http://virtualobservatory.org/whatis/history.aspx} \href{http://virtualobservatory.org/whatis/history.aspx}{this link} for a history of the US Virtual Observatory efforts.) The grant essentially implemented a vision for sharing astronomy data online put forward in a \textit{Science} article about "The WorldWide Telescope" by Szalay and Gray in 2001 \cite {2001Sci...293.2037S}. The scope of this research was broad, including standards development and professional outreach to scientists (See \cite{vobook}). In 2010, NASA and NSF reached a cooperative agreement to fund and maintain a US Virtual Astronomical Observatory, implementing the research done under the 2001 Framework grant as a formal structure for tool and standards development, as well as a venue for professional and public outreach about the VO. Unfortunately, NSF announced plans (now being implemented) to de-fund its (80\%) share of the US VAO, leading to a cessation of the US VAO in September 2014. Opinions on why and how this happened are beyond the scope of this paper. What is important for our purposes is to point out that 1) the scope for both the NVO and VAO efforts skewed toward serving large, homogenous datasets; 2) the most robust, important and adopted infrastructure-related efforts of the VAO, like the VO "Registry" essential for tools to find data, are not at all secure from funding cuts. These two facts we feel have sought to undermine the ability of the VO to serve the data sharing needs of astronomers while also putting doubt in the minds of astronomers thinking about doing extra work to share their data. It is worthwhile to summarize the successes of these VO development efforts. Certainly, large archives of primary data have embraced standards based data access and sharing, which is a success that will have long lasting impacts. It is these rich interfaces that enable the creation of the kinds of data aggregation tools envisioned by Szalay & Gray. Some tools, such as the recently released US \href{http://vao.stsci.edu/portal/Mashup/Clients/Portal/DataDiscovery.html}{US VAO Data Discovery tool \url{http://vao.stsci.edu/portal/Mashup/Clients/Portal/DataDiscovery.html} tool} could not exist without VO tools like the "Registry" and data access protocols that have been adopted by the archives. In 2008, Microsoft Research released a free software package named "WorldWide Telescope" (WWT), in honor of Szalay and Gray's 2001 vision. Today, WWT, which uses a large amount of infrastructure established under the NVO and VAO grants, and connects to many services developed outside the US (under the "International Virtual Observatory Alliance" standards) is probably the best US-origin implementation of the virtual observatory vision of connected datasets. The combination of tools offered by the Centre de Donnees astronomiques de Strasbourg (CDS; \url {http://cds.u-strasbg.fr}) (\href{http://cds.u-strasbg.fr}{CDS}) also offer excellent access to VO services. Many data sets from NASA and other large survey providers are available within WWT and CDS tools, and astronomers can offer their own data in these frameworks as well, but uptake is still slower than one might imagine. One example of a medium-size survey (COMPLETE; see \url {http://www.cfa.harvard.edu/COMPLETE/data_html_pages/data.html}) (\href{http://www.cfa.harvard.edu/COMPLETE/data_html_pages/data.html}{COMPLETE}) being served at a research group's web site using an HTML5 WWT client is at \url {http://www.worldwidetelescope.org/complete/wwtcoveragetool5.htm}. \href{http://www.worldwidetelescope.org/complete/wwtcoveragetool5.htm}{here}. A summary of the usage and functionality of WWT in research and education is offered in \citet {2012ASPC..461..267G}. \subsection{The Dataverse Network} The authors of this article are involved in a project, in conjunction with this study, that both educates astronomers on data management practices (e.g., \cite{citationprinciples, tenrules}) and provides a technical solution to these problems. The approach on which the project is built is rather different than that of the virtual observatory. Rather than attempting to build a framework and a related set of standards and protocols, we focused on the implementation of an easy-to-use tool that can solve an immediate problem: the storage, citation, and discovery of secondary data in astronomy. In other words, we have found with this study that many astronomers today have derived data that "does not fit" in a scholarly paper. Where can they store these data upon publication with a certainty that they can be retrieved, cited and discovered? The technical solution we developed involves the use of the \href{http://thedata.org/}{the Dataverse Network (\url{http://thedata.org/}), Network}, a web application for sharing, citing, and archiving social science data \cite{King2007}, \cite{Crosas2011}, \cite{Crosas2013}. The Dataverse Network is an open source software application, developed by the Institute for Quantitative Social Science (IQSS) at Harvard University \cite{King2014}. The Dataverse software is a multi-tier Java Enterprise Edition (EE) application with an underlying open-source relational database (PostgreSQL) for application data (such as users, roles, permissions) and metadata, and a file storage component for the actual data files. The application tiers include a user-interface layer which employs the latest Java Server Faces components suite, PrimeFaces, a business-logic layer implemented with Java Beans which is represented by the object model, and a persistent layer. The application runs in a Glassfish application server. This common Java EE architecture is robust and easily allows scalability and extensibility. The Dataverse enables discoverability by searching across all descriptive metadata or cataloging fields, in addition to information extracted from data files. The metadata is indexed using the Lucene Index Server (Solr, in particular, in Dataverse version 4). The metadata is also mapped to various standard metadata schemas, such as Dublin Core and DDI (http://www.ddialliance.org/), and exported to XML format for preservation purposes. All data files and any complementary files, such as code or documentation, are stored in a Network File System. A Dataverse Network consists of dataverses, and each dataverse can be branded or customized for an individual researcher, or group, or project, or journal. A dataverse owner has control over the branding, the metadata, and the sharing and release of the data, thus he can completely manage his own virtual data archive, while all data are stored in a centralized, public research data repository that guarantees proper archival and long-term access. The Dataverse Network follows good practices for scientific data publication: 1) supports metadata standards and enables the inclusion of accompanying code and other materials for each dataset, 2) provides versioning of a dataset, with easy access to previous versions of the data and metadata, 3) assigns a persistent identifier (DOI) and generates a full data citation, with attribution to data authors and distributors (\cite{AltmanKing2007}). The generated data citation follows the recently proposed principles for data citation, and international initiative which recognizes that 'data should be considered legitimate, citable products of research' \cite{citationprinciples}. Once a dataset is released for publication, it cannot be unreleased, to guarantee that the data citation, and its persistent url, can always be resolved to a data page that includes sufficient information about the dataset and access to the data files. In some uncommon cases, a dataset might be deaccessioned due to a retraction or legal issue, but even in these cases, the persistent identifier in the data citation will still resolve to a page with information about the missing dataset.

\item provide a data citation for every dataset uploaded. The citation includes a persistent identifier which links to the data, and can be added to the the references sections of any publication. \end{enumerate} For the everyday astronomer TheAstroData flips the equation of data sharing in a virtual obsevatory context on its head. It trades interoperability that comes with homogenized data sets for ease of data sharing by astronomers. Search functions focus on descriptive metadata instead of quantified slicing of datasets by physical quantities such as location on the sky. This trade off is not permanent, and we assert that the kinds of data access envisioned by Szalay & Gray for small published datasets can be achieved ex post facto. Our plans are to re-index (or expose the file level metadata related to) shared data files, extracting addtional numerical metadata fields to enable finer grain search. Further, the audience for TheAstroData is completely transparent and focused on indivdiual scientists or projects that have dervied (and often heterogeneous) datasets to share or to publish along side a refereed paper. It is already the case that TheAstroData datasets are linked to literature publication records in two ways. Foremost, we provide primary publication-to-dataset links to the SAO-NASA Astrohpysical Data System (ADS) \url{http://adsabs.harvard.edu/}, \href{http://adsabs.harvard.edu/}{ADS}, which is the universal liteature resource for all of astronomy; an astronomer's TheAstroData datasets appear as "Data Archive" links in the primary publication's ADS record. Second, our records are listed in the Thomson-Reuters \href{http://wokinfo.com/products_tools/multidisciplinary/dci}{Thomson-Reuters Data Citation Index \url{http://wokinfo.com/products_tools/multidisciplinary/dci/}, Index}, which makes use of the Dataverse Network's OAI-PMH harvesting interface. Our future plans include transmutating the rich DDI metadata standards adopted by the Dataverse Network and enhanced with our astronomy specific extensions means into VO standards and exporting this version to indexing tools such as the VO Registry (or similar data publishing registry). We anticipate that our adoption of the Dataverse Network for TheAstroData has two additional benefits for everyday astronomers: \begin{enumerate} \item In future iterations of TheAstroData, we plan to reuse the data analysis capabilities of the Dataverse Network software to allow integration of astronomers FITS data with new visualization or analysis tools, for example, GlueViz \url{glueviz.org};