Authorea

Alberto Pepe generated LaTeX version of article almost 11 years ago

Commit id: 97ff86064d263df3dd9f5615ac0dc47a8280f06e

deletions | additions

% Template for PLoS % Version 1.0 January 2009 % % To compile to pdf, run: % latex plos.template % bibtex plos.template % latex plos.template % latex plos.template % dvipdf plos.template \documentclass[10pt]{article} % amsmath package, useful for mathematical formulas \usepackage{amsmath} % amssymb package, useful for mathematical symbols \usepackage{amssymb} % graphicx package, useful for including eps and pdf graphics % include graphics with the command \includegraphics \usepackage{graphicx} % cite package, to clean up citations in the main text. Do not remove. \usepackage{cite} \usepackage{color} \usepackage{url} % Use doublespacing - comment out for single spacing %\usepackage{setspace} %\doublespacing % Text layout \topmargin 0.0cm \oddsidemargin 0.5cm \evensidemargin 0.5cm \textwidth 16cm \textheight 21cm % Bold the 'Figure #' in the caption and separate it with a period % Captions will be left justified \usepackage[labelfont=bf,labelsep=period,justification=raggedright]{caption} % Use the PLoS provided bibtex style \bibliographystyle{plos2009} % Remove brackets from numbering in List of References \makeatletter \renewcommand{\@biblabel}[1]{\quad#1.} \makeatother % Leave date blank \date{} \pagestyle{myheadings} \begin{document} \title{Handling, archiving, and citing data in astronomy } \author{Alberto Pepe, Alyssa Goodman, August Muench, Merce Crosas, Christopher Erdmann} \maketitle \section{Abstract} We report the results of interviews with astronomers at the Harvard-Smithsonian Center for Astrophysics. \cite{http://adsabs.harvard.edu/abs/2011ApJ...743..201P} \cite{http://adsabs.harvard.edu/abs/2009Natur.457...63G} \begin{quote} No, I don't have a website where I store these data. Most of it is in various stages of mess. ---An Astronomer \end{quote} \section{Introduction} Astronomers produce and peruse vast amounts of scientific data. Making these data publicly available is important to enable both reproducible research and long term data curation and preservation (King, 1995, "Replication, Replication", Political Science and Politics, 28: 443-449). Because of their sheer size, however, astronomical data are often left out entirely from scientific publications and are thus hard to find and obtain. In recent years, more and more astronomers are choosing to store and make available their data on institutional repositories, personal websites and data digital libraries. Just to show how citations work, here is a cite to Batista's work \cite{batista} and Leo Egghe's \cite{gindex}. While here is a citation which is not even in the bibliography file but it is on ADS so it can be cited by URL \cite{http://adsabs.harvard.edu/abs/2007prpl.conf..133G}. we describe the use of personal data repositories as a means to enable the publication of data by individual astronomy researchers. by repository we mean in astronomy this accumulation might include the collection of bits of raw images taken at the telescope or subsets of processed data from a space observatory archive. from this collection or pile of data, the data stack is distilled into new research objects. for example, raw spectra are calibrated and combined into a higher s/n data product. these distilled products are further refined even chopped up into smaller bits where the relevant scientific information packet is much more highly concentrated; we consider such a packet of knowledge “publishable” consider this flow of information then consider just how linear it appears to be. the typical end of this evolution of accumulation and distillation the research data is the publication. there are a few problems with data objects appearing in papers: if at all they capture the most refined research objects. they fork only with the paper. the avoid curation by domain specific experts -- the journals have neither a peer-review process nor an editorial process for “data”. worse, they are not trackable in the papers. Even if they do have identifiers and even if those identifiers , these data products require a different framework for reuse. By data materials, we mean any data product available on the web which was either instrumental for the pursuit of research, e.g. raw data from astronomical archives, or generated in the context of research, e.g., reduced and processed data presented in a paper. \section{Results} \subsection{Exploratory analysis of data citation practices} To begin, we mine a corpus of astronomy articles for external web links. By ``external web link'' we mean: any outgoing link embedded in the final published version of an article (e.g., its PDF or HTML format) which points to an online resource in the \url{http} (or \url{https}) URI scheme. The purpose of this exploratory analysis is to assess whether astronomers use links within articles to point to datasets and related supplemental data resources. We analyze a corpus of all articles published in the four main astronomy journals (The Astrophysical Journal, The Astrophysical Journal Letters, The Astrophysical Journal Supplement, The Astronomical Journal) between 1997 and 2008. We find a total of $13447$ potential links to datasets in a total of $7641$ publications. The detailed procedure by which potential data links are selected and filtered is described in the Materials and Methods section. In the barplot of Figure \ref{fig:barplot} we show how linking practices have changed over time. Links to potential data resources in astronomy first appear in 1997, with only a couple of dozens links published in that year, and quickly increases every year to reach around $1500$ yearly links in 2005. After 2005, the volume of total published links roughly stays the same every year. The graph shows that with widespread use and adoption of the WWW, linking to online resources within published articles has become more and more popular. The bars in the barplot of Figure \ref{fig:barplot} also depict whether published links are still available as of December 2011: the green portion of each bar represents the volume of valid links (HTTP status code 200: OK), while the grey portion of the bars represents broken links (HTTP status codes 3xx, 4xx, and 5xx). This link categorization shows that half or more of all links published prior to 2001 are now broken. The percentage of broken links decreases with time to reach roughly 10\% in 2008: one in ten links included in astronomy papers in 2008 is unreachable three years later. This analysis can be pushed further by exploring two distinct subsets of the astronomy link corpus. In Figure \ref{fig:lines} we show how the percentages of broken links differ over time for a set of $1801$ links to personal websites (links which contain the tilde symbol \~ , which are usually reserved for personal web pages on institutional servers) and a set of $3731$ links to institutional, curated archives (a manually selected list of domains that are obvious astronomy archives, such as \url{archive.stsci.edu}). Attempting to make a distinction between these two categories of links is of crucial importance. The former set of links, the ``tilde links'', are potential pointers to datasets found on personal websites. These may consist of data tables and images which are the product of data analysis and reduction procedures described in the accompanying paper. As such, they do not belong to larger curated archives, which normally host raw data only. Ideally, these datasets would be included in the full text of the article, but oftentimes they are too large to fit within the format of a published paper and are included on a personal server and linked from within the paper. The latter set of links, the ``curated archives'' links is, instead, a collection of pointers to established archives and repositories, managed and curated by institutions, surveys, telescope sites. Authors may want to link to these resources to cite and acknowledge the raw data sources that they employed in their research. Figure \ref{fig:lines} shows that the availability of these two categories of links follow very different, yet expected, patterns. The vast majority of ``tilde links'' published between 1997 and 2003 is not available any more (personal links are depicted as a black solid line and circles). Astronomers change locations, jobs, institutions and, as such, their personal web servers change or expire over time. However, the percentage of broken links to personal websites falls rapidly: nearly all ``tilde links'' published in 2008 are still accessible today. A different scenario emerges when one looks at the temporal pattern for links to curated archives (depicted in the graph as a red line and crosses): the percentage of broken links stays roughly the same over time (between 15\% and 20\%), indicating that curated, institutional websites are much less vulnerable to temporal effects than personal websites. This exploratory analysis reveals three key findings. First, since the inception of the web in the early 1990's, astronomers have increasingly used links in articles to cite datasets and other resources which do not fit in the traditional referencing schemes for bibliographic materials. Second, as for nearly every resource on the web, availability of linked material decays with time: old links to astronomical materials are more likely to be broken than more recent ones. Third, links to ``personal datasets'', i.e., links to potential data hosted on astronomers' personal websites, become unreachable much faster than links to curated ``institutional datasets''. These findings point to a preliminary realization: that astronomers do have a need to reference and include data materials in their published work. Since they lack a standardized mechanism to reference these resources --- data citations do not normally fit in the format, structure, and scope of published journal articles --- they attempt to cite datasets using simple linking from within articles. Results from this preliminary analysis prompted a qualitative interview study, described below. \subsection{Interview results} We conducted interviews with a dozen astronomers of the Harvard-Smithsonian Center for Astrophysics working in disparate fields of astronomy and at different stages of their career: postdoctoral researchers, staff scientists, tenure-track and tenured faculty. All interviews were conducted in person between March and July, 2011. The purpose of the interviews was to gather a first-hand account of the needs and challenges of data referencing and archiving in astronomy. Our interview rubric was freely based on the Data Curation Profiles Toolkit developed by the Distributed Data Curation Center at the Purdue University Libraries and the Graduate School of Library and Information Science at the University of Illinois at Urbana-Champaign (\url{http://datacurationprofiles.org/}). Before every interview we created a record of the interviewee which contained key information such as name, academic role, affiliation, department, area of specialization, website, as well as an annotated list of recent and/or prominent astronomy projects pursued and published datasets, and pointers to one or two recent published articles, possibly containing links to daatsets. The template for our semi-structured interview consists of questions revolving around these topics: \begin{description} \item[A story] We begin with a very open-ended question, asking astronomers to tell us a story about their data. In the case of very prolific authors, we ask them to focus their story around a specific paper or project. We allow the researcher to discuss about their research, their data practices, their data output, their scientific work flow, and their community of practice. With this first question, we gauge potential projects and paper and we steer the conversation towards a specific one, which becomes the subject of the following questions. \item[Generated output] What were the important stages of data production, analysis and interpretation? Did you collect new data? Archival data? How dependent are your results on the software tools used in each stage of the data analysis? Did you create new software? \item[Availability] Are any/all of these data currently available for download/perusal? If yes, where? What platform are you using? What stages, versions or types of the data are available? If not, why not? Would you be happy to make those data available? \item[Data citation] How can your data be cited/referenced? Can you pinpoint some publications that were clearly based on these data? Are these publications on ADS? \item[Format and size] Are the data available as separate files? What formats are they in? How large are they? \item[Ownership] What sort of licensing do you envision for your data? Do you have contractual obligations and/or restrictions to preserve or share your data? \item[Desired features] If your data were to be made available on a platform that allows their storage, discovery, and citation, would you want to offer visualizations of your data? Would you want to allow users to run simple statistical analyses on your data? Would you allow users to download the entire datasets or portions of thereof? \end{description} \subsubsection{Data stories} During the interviews, we listened to a very diverse collection of data stories. In most cases, the stories were very much rooted not only in the specific project that we were being told about, but in the data practices of a given subdiscipline of astronomy. For example, an interviewee working with quasars monitors and regularly publishes flux density data which are used for calibration purposes. These data are relatively limited in size and are hosted on an institutional webserver: \begin{quote} There is a website which is essentially a flat ASCII file that has information for a particular day for a given number of quasars. I convert the raw data into a standard format with columns: source, date, time frequency, flux and error. \end{quote} Another example is an interviewee working with galaxy clusters who told us that the amount of data handled and processed in their research is so large that it involves the joint work of many staff scientists and graduate students. Hosting and providing access to the various levels of data involved in the production of the final reduced data is beyond the capabilities of a single research group. In their own words: \begin{quote} We could certainly put a data table in the publication with very heavily digested quantities like velocity dispersion and number of galaxies, but those things are derived from upstream raw data. You would argue that it would be more value to the community if we were to make the image archive available. I am probably not going to send all the Magellan and HSST images to the ApJ though. But I could well imagine twenty years in the future that that image archive has more endured value than our attempt to extract information out of those images. \end{quote} These two examples are telling of the differing scales at which data practices operate: from small continually-updated datasets which are currently hosted on personal webservers to large, collaboration-enabled surveys whose data do not have an obvious home. Overall, we found that the mechanisms by which data are used and handled differ widely from project to project and between different subdomains and wavelengths. \subsubsection{Generated output} As for the previous question, the data products generated in the context of different research endeavors, and their prodcution mechanisms, varied greatly between different projects. An interviewee, for example, indicated that the source of their research is entirely archival data and that the bulk of their research is writing the software and running analyses with it: \begin{quote} We just used and combined catalog data from many different large area surveys containing photmetric description of different extragalactic sources (galaxies and quasars): their magnitude, fluxes, and morphological parameters. Then we subjected these large tables to some Machine Learning methods to estimate the redshift of the sources. The result was an augmented table which included additional information about estimates of photometric redshits. \end{quote} In some other cases, astronomers were interested exclusively in the scientific findings of their research; the mechanisms by which the data were reduced and analyzed might have not been documented properly: \begin{quote} We didn’t write software from scratch, but we used it in ways that might not be so easily reproduced. That’s what you read in the data section of a paper when it says something like: \textit{we smoothed the data to such and such a resolution and then we did this and then we did that}. Whether the person [running the analysis] gets the order of the steps right may actually affect the final outcome. I am not sure whether these software workflows got perfectly documented. \end{quote} Despite the many types of data products generated, a visible thread of similarity between responses can be found in the prominence of social and human factors involved in the production of these data products. Interviewees often reported that the various levels of data generated are entirely in the hands of the people involved in the projects. An interviewee summarized the prevalence of this practice as: \begin{quote} If we were rich and organized we would be like Sloan and we would have: Data release 1.0, Data release 2.0, etc. But we have more like: Graduate student 1, Graduate student 2, Graduate student 3 (laughs) \end{quote} \subsubsection{Availability} All the astronomers interviewed in this study state that they are willing to share with the public all the reduced data generated in the context of the discussed projects. Only two-thirds of them, however, have gone through the effort of storing the data and making it available online. The vast majority of those that currently make available their reduced data online chooses to use a dedicated personal webserver, generally accessible from the Principal Investigator's personal website or group laboratory page. The flavors and levels of data offered on these personal webservers differs greatly among projects. however. Some astronomers limit themselves to posting the minimum amount of data necessary to supplement a published article, or to accommodate the requests of the referees to see the data. In some other cases, astronomers post various levels of data, from raw to reduced data. Yet, whether the amount and description of data suplied is sufficient to entirely replicate a study is unclear and varies from case to case. One astronomer admits that access to raw data is a barrier to reproducibility of results: \begin{quote} Could we get the raw data from that survey? We did not archive the totally raw unreduced data but there is a tape library somewhere with all the data, but it would be difficult to find. And so I’d give you maybe sixty percent odds that we could get that data now. Those raw data were taken in 2001, 2003, 2004, and maybe some in 2005. I don’t even remember. \end{quote} Another astronomer working with raw data from a larger survey (Sloan Digital Sky Survey) indicated that the raw data used in their study are indeed available somewhere (on the SDSS archives), but has doubts on whether linking raw to reduced data has a real utility: \begin{quote} How many people re-reduce SDSS images? I make a guess: there are probably ten people on the face of earth that ever re-reduced Sloan images. \end{quote} Only a couple of interviewed astronomers employed other techniques to make the data available, which do not involve posting data to a private webserver. For example, the catalogs of photometric redshifts discussed earlier on were made available via dedicated services in the VO framework (Virtual Observatory). They can be accessed through the VO registry and through a number of popular astronomy applications. \subsubsection{Data citation} Interviewees are also unsure about the best way that other researchers can cite their data. If they have published a ``data paper'', i.e. a refereed article describing the data, the data collection, and analysis in detail, they prefer to receive a citation to the paper. In all the other cases, they are happy to just receive mention of the via a URL link pointing to the data or an acknoweldgement in the publication. \begin{quote} Journals don’t seem to be concerned with standardizing that [how data are cited]. If you use the data from someone else’s project then we just say we downloaded it from the archive. Sometimes people cite the program number and other times people go through the trouble of seeing if a paper has been published on it. \end{quote} \subsubsection{Format and size} All astronomers unanimously indicated FITS (Flexible Image Transport System) to be the data format of choice for all their data needs. As one astronomer aptly summarized: \begin{quote} The FITS format does everything I need. It's hard to change. It is a ubquitous self-defining data structure. You can download one from 20 years ago and it still works. \end{quote} As for size, the spectrum was much more diversified with some small datasets, e.g., in the range of few Megabytes for quasar density flux data, some medium-sized datasets, e.g., up to a dozen Gigabytes total for the thermal emission data from the survey of star forming regions, to some much larger archives in the order of many Terabytes, e.g., for galaxy cluster image data. \subsubsection{Ownership} Astronomy is a discipline which studies a matter --- celestial objects and astronomical phenomena --- that are by definition public domain. This is probably why the inclination to share data seems to be ingrained in the mindframe of virtually all astronomers. None of the interviewed researchers indicated that the data were ``theirs'' or that they were under contractual agreements of working under restrictions that would impede them to share their reduced data. All astronomers indicated that their data, no matter how reduced and ingested from its original raw format, were public data. This remark was stressed even more by two interviewed ``computational astronomers'' whose research is based on the aggregation and analysis of data in existing astronomical catalogs: \begin{quote} We truly believe that sharing data is the right thing to do, simply because the original data we used for this study was not ours. Our study was only possible because other astronomers made their data publicly available in the first place! \end{quote} \subsubsection{Desired features} We asked astronomers whether they could think of any specific features that an online hosting platform for their reduced data should have in order to allow easy access, visualization, and analysis by users. All respondents indicated that such a platform should, at the most basic level, allow citation and download of the data. Another very basic feature suggested by nearly every interviewee is the ability to select and download only a subset of the data available for a specific project, rather than the entire dataset. Thus, for example, a user should be able to select a region of the sky delimited by coordinates (Right Ascension, Declination and an angular radius) and download matching observations for that region. For time-varying phenomena, the ability to subset by temporal parameters was indicated. Only a small portion of the people interviewed indicated the need for a more sophisticated filtering and subsetting mechanism, supported by a strong query language and/or full interoperability with existing frameworks, such as the VO registry. Interestingly, none of the interviewed astronomers suggested that the data hosting platform features advanced analysis and visualization techniques. \section{Discussion} We find that astronomers are increasingly willing to reference and share the secondary or processed data sets used to derive the results in their publications. However, a common infrastructure to share this type of data sets and guidelines for good practices on how to cite them are still lacking. This results in invalid data references over time and incomplete publications which can not be validated or built upon them. This group is involved in a project that has provided a solution to these problems in social science (refs), and is now in the process of being adapted to astronomy (theastrodata.org, seamless astronomy refs). The project, which uses the Dataverse Network software as the underlying infrastructure (refs), intends to achieve two main goals, both critical in data sharing; 1) a central repository where (small) astronomy data sets can be deposited and archived for long term access, and 2) a data citation that includes a persistent identifier which links to the data, and should be added to the the references sections of any publication. The central repository not only serves as a mere file system to drop and access data files, but instead provides the tools to understand the nature of the data sets and how they can be reused. It accomplishes this by allowing to add descriptive metadata about the data set and complementary files such as documentation and code, and extracting metdata automatically from the data file. It also provides the infrastructure to replicate the data files to multiple locations and export the metadata to make the data sets more easily discoverable by other systems. A formal data citation is the other key piece of data sharing. It provides a persistent link between the publication and the data set, so that if the location of the data set changes in the future, the persistent link can still be resolved to the same data set (ref. to Handles). It also provides attribution to the various contributors - authors and data producers or providers - properly given credit to the authors that collected and process the data. Finally, a formal, standardized data citation is needed to facilitate the adoption of data citation by publishers - it is critical that this type of citations become part of the references sections in publications, and are easily traceable to derive their impact. \section{Materials and Methods} We analyze a corpus of all articles published between 1997 and 2008 in the four main astronomy journals (The Astrophysical Journal, The Astrophysical Journal Letters, The Astrophysical Journal Supplement, The Astronomical Journal) which contain external URL links in their full text. We initially find $33847$ external links in $13390$ articles. \url{http://hdl.handle.net/10904/10214} \cite{astrocite} In order to isolate potential links to datasets from this list, we perform the following filtering workflow. First, we remove links to domains that are scholarly repositories and which obviously do not host data (or which did not host data prior to 2008). These include domains such as \url{dx.doi.org}, \url{arxiv.org}, \url{xxx.lanl.gov}, and \url{adsabs.harvard.edu}. Removing links to these domains, which are obviously pointers to articles, narrows down the corpus to $26663$. Second, we remove all links which are found in the reference list of an article. While it is entirely possible that authors cite datasets in the same way as they cite bibliographic references, an exploratory analysis revealed that links in the reference section of a paper were, by and large, pointers to articles, preprints, star catalogs, circulars, manuals, and user guides. Therefore, we remove these ``reference links'', bringing the corpus down to $20767$ links. Third, based on the observation that links to datasets are generally not found at the root of a website hierarchy, we removed links that contain less than 2 forward slashes (other than the two slashes found in the leading ``http://''). For example, the link to \url{http://www.sdss.org} was dropped from the corpus (0 slashes), while the link to \url{http://www.cfa.harvard.edu/COMPLETE/data_html_pages/data.html} was retained (3 slashes). This final filtering procedure reduces the corpus to $13447$ links, which we consider potential links to datasets. \cite{astrocite} Some descriptive statistics about this corpus of links is presented in Table \ref{tab1}. \subsection{Acknowledgments} We thank Michael Blake and Tomoko Kurahashi who helped with interviews, transcription, and coding, and with data curation, respectively. We also thank Alberto Accomazzi, Jay Luker, and the Astrophysics Data System team at the Harvard-Smithsonian Center for Astrophysics for providing access to the bibliographic data used for the exploratory data citation analysis. \subsection{Figures} \begin{figure}[tb] \includegraphics[width=\columnwidth]{figures/figure1/figure1.jpg} \caption{\textbf{Figure 1. Volume of potential data links in astronomy publications.} Total volume of external links in all articles published between 1997 and 2008 in the four main astronomy journals, color coded by HTTP status code. Green bars represent accessible links (200), grey bars represent broken links.} \end{figure} \begin{figure}[tb] \includegraphics[width=\columnwidth]{figures/figure2/figure2.jpg} \caption{\textbf{Figure 2. Percentage of broken links in astronomy publications according to type of website.} Percentages of broken external links in all articles published between 1997 and 2008 in the four main astronomy journals. Black circles represent links to personal websites (link values contain the tilde symbol, \textasciitilde), while red crosses represent links to curated archives such as governmental and institutional repositories.} \end{figure} \subsection{Tables} \begin{table} \caption{\textbf{Table 1. Some descriptive statistics about top domains linked in astronomy publications}. This table lists total number of links and broken links (HTTP status codes 3xx, 4xx, and 5xx) to top domains (domains with over 100 links) found within articles published in the four main astronomy journals between 1997 and 2008. The table also shows, for each domain, the portion of links to common filename extensions, as well as links that contain the tilde character.} \begin{tabular}{l|cccccccc} \hline {\bf Domain}&\textbf{links (broken)}&\textbf{.html}&\textbf{.txt}&\textbf{.dat}&\textbf{.gz}&\textbf{.tar}&\textbf{.fits}&\textbf{tilde}\\ \hline\hline cxc.harvard.edu&802 (110)&336 (70)&0&0&4 (2)&5 (4)&1&0\\ heasarc.gsfc.nasa.gov&640 (33)&423 (27)&1&0&0&0&0&0\\ www.stsci.edu&498 (61)&205 (29)&3&0&0&0&0&15 (10)\\ asc.harvard.edu&471 (152)&212 (99)&0&0&0&0&0&1 (1)\\ ssc.spitzer.caltech.edu&427 (194)&125 (76)&3 (3)&0&0&0&0&0\\ cfa-www.harvard.edu&352 (68)&277 (52)&1&0&0&0&0&54 (17)\\ archive.stsci.edu&308 (58)&57 (9)&2&1 (0)&0&0&0&0\\ www.ipac.caltech.edu&285 (14)&209 (12)&0&0&0&0&0&0\\ www.atnf.csiro.au&211 (21)&12 (6)&0&0&0&0&0&7 (5)\\ space.mit.edu&193 (10)&58 (5)&1&0&0&0&0&2 (1)\\ www.astro.psu.edu&186 (4)&103 (1)&1&10&1&1&0&2\\ www.eso.org&186 (58)&54 (22)&1 (1)&0&0&0&0&4 (1)\\ irsa.ipac.caltech.edu&163 (5)&38&0&0&1&0&0&0\\ www.sdss.org&156 (2)&106 (1)&0&0&0&0&0&0\\ hea-www.harvard.edu&125 (37)&42 (17)&1&0&0&1&0&26 (16)\\ physics.nist.gov&125 (3)&63 (2)&0&0&0&0&0&0\\ www.noao.edu&120 (3)&50 (2)&0&0&0&0&0&0\\ xmm.vilspa.esa.es&118 (35)&23 (19)&0&0&8 (1)&0&0&1 (1)\\ www.astro.princeton.edu&115 (31)&43 (14)&0&0&0&0&0&53 (12)\\ ad.usno.navy.mil&110 (27)&98 (22)&3 (3)&0&0&0&0&1 (1)\\ \end{tabular} \label{tab1} \end{table} \bibliography{bibliography/biblio} \end{document}