How do astronomers share data? Reliability and persistence of datasets linked in AAS publications and a qualitative study of data practices among US astronomers.
We analyze data sharing practices of astronomers over the past fifteen years. An analysis of URL links embedded in papers published by the American Astronomical Society reveals that the total number of links included in the literature rose dramatically from 1997 until 2005, when it leveled off at around 1500 per year. This rise indicates an increased interest in data-sharing over the same time period that the web saw its most dramatic growth in usage in the developed world. The analysis also shows that the availability of linked material decays with time: in 2011, 44% of links published a decade earlier, in 2001, were broken. A rough analysis of link types reveals that links to data hosted on astronomers’ personal websites become unreachable much faster than links to datasets on curated institutional sites. To gauge astronomers’ current data sharing practices and preferences further, we performed in-depth interviews with 12 scientists and online surveys with 173 scientists, all at a large astrophysical research institute in the United States: the Harvard-Smithsonian Center for Astrophysics, in Cambridge, MA. Both the in-depth interviews and the online survey indicate that, in principle, there is no philosophical objection to data-sharing among astronomers at this institution, and nearly all astronomers would share as much of their data as others wanted if it were practicable. Key reasons that more data are not presently shared more efficiently in astronomy include: the difficulty of sharing large data sets; over reliance on non-robust, non-reproducible mechanisms for sharing data (e.g. emailing it); unfamiliarity with options that make data-sharing easier (faster) and/or more robust; and, lastly, a sense that other researchers would not want the data to be shared. We conclude with a short discussion of a new effort to implement an easy-to-use, robust, system for data sharing in astronomy, at theastrodata.org, and we analyze the uptake of that system to-date.
No, I don’t have a website where I store these data. Most of it is in various stages of mess. —An Astronomer
Astronomical observations can generate very large volumes of data, and observations taken at a particular time are by definition irreplaceable and unrepeatable. As such, making astronomical data publicly available in a structured, intelligible format is of fundamental importance to enable scientific transparency and long term data curation and preservation, facilitating data re-use (King 1995).
To date, some of the most systemically planned data sharing in astronomical research has focused on the preservation and dissemination of observations created in so-called “sky surveys.” The purpose of these surveys is to collect and measure data from extended regions of the Sky, in a systematic and controlled fashion. Modern optical sky surveys, such as the Sloan Digital Sky Survey (SDSS), the 2-Micron All-Sky Survey (2MASS), and the future Large Synoptic Survey Telescope (LSST) generate massive databases, ranging in size from hundreds of terabytes to hundreds of petabytes (Borne 2010). Surveys that rely on spectrally-resolved observations, often made with radio-wavelength interferometers, generate “3D Data Cubes” rather than “2D images,” and they are already so large that it is not possible to keep all the raw data after analysis is complete.
Despite their sheer volume, the data collected in the context of large surveys represent only a portion of all the data generated in Astronomy. Most discoveries rely upon smaller studies, and/or are based on heavily-processed subsets of many surveys. In any field of scientific endeavor, many different levels of data exist (Borgman 2012): from “raw” data to “processed” data, from “calibration” data to “published” data. If we imagine all data in Astronomy to be a pyramid, primary data from large sky surveys occupies the bottom half of the pyramid. But, as we just mentioned, these primary data are used by astronomers all over the world to produce more specific studies, where astronomers analyze and process primary data in many ways producing derived data.
The physical and astronomical sciences have a well established reputation for being disciplines with a strong culture of data sharing. Astronomy has pioneered Open Access to both publications and data. In fact, the data generated by large sky surveys, such as those indicated above, are often collected under government-sponsored grants, archived by government-sponsored institutions (e.g., NASA), and made publicly available to anyone (e.g., at http://archive.stsci.edu/). The fact that astronomical data from large surveys are publicly available is remarkable, but by no means surprising. Astronomers collect data about the Universe, and thus, they may feel a moral obligation to share collected data openly. Moreover, most US granting agencies relevant to Astronomy (e.g., NASA, NSF) now require data to be made openly available.
Astronomers often have access to efficient and robust mechanisms that serve to archive, curate, and make primary data available (e.g. http://archive.stsci.edu/, http://ned.ipac.caltech.edu/, http://skyview.gsfc.nasa.gov/, http://simbad.u-strasbg.fr/simbad/). But very few parallel systems exist for derived data. Because most, if not all, scientific articles in Astronomy are based on derived data, making such data visible, intelligible and available to the public is of fundamental importance.
In this article, we analyze how the processes of sharing, archiving, and citing derived astronomical data is presently accomplished. Our research is based upon a quantitative link structure analysis and a qualitative study, composed of interviews and a survey. The results of this article are divided in two sections, accordingly.
In the first part of the results, we report on a link analysis performed on all articles published in the Astronomy journals published by the American Astronomical Society (AAS) between 1997 and 2008. To carry out this analysis, we collaborated with the leaders of the “
In the second part of the results section, we report findings from a personal interview study conducted with a dozen astronomers at the Harvard-Smithsonian Center for Astrophysics and a follow-up survey conducted at the same institution (173 respondents). The Center for Astrophysics is a large astrophysics institution in the United States with roughly 1000 employees, 300 of whom are PhD researchers from around the globe. The purpose of this dual qualitative study was to document the data sharing practices of an astronomical community in a semi-structured format. We found that 1) astronomers produce derived data in standard astronomical formats, 2) they are overwhelmingly willing to share their data with their peers and the public, 3) they are normally unaware of mechanisms for archiving and citing derived data, and 4) they rely upon non-automated, non-standard methods to acquire and provide derived data (e.g., they put derived data on their website and link to it, they contact paper authors to obtain data).