August Muench edited Results_exploratory_analysis.tex  about 11 years ago

Commit id: c491f02de8ec9ec8684577f4fce95c32556bdb07

deletions | additions      

       

\subsection{Results} \subsubsection{Exploratory analysis of data citation practices} To begin, we mine a corpus of astronomy articles for external web links. By ``external web link'' we mean: any outgoing link embedded in the final published version of an article (e.g., its PDF or HTML format) which points to an online resource in the \url{http} (or \url{https}) URI scheme. The purpose of this exploratory analysis is to assess whether astronomers use links within articles to point to datasets and related supplemental data resources. We analyze a corpus of all articles published in the four main astronomy journals (The Astrophysical Journal, The Astrophysical Journal Letters, Astronomy \& Astrophysics, The Astrophysical Journal Supplement,  The Astronomical Journal) between 1997 and 2008. We find a total of $13447$ potential links to datasets in a total of $7641$ publications. The detailed procedure by which potential data links are selected and filtered is described in the Materials and Methods section. In the barplot of Figure \ref{fig:barplot} we show how linking practices have changed over time. Links to potential data resources in astronomy first appear in 1997, with only a couple of dozens links published in that year, and quickly increases every year to reach around $1500$ yearly links in 2005. After 2005, the volume of total published links roughly stays the same every year. The graph shows that with widespread use and adoption of the WWW, linking to online resources within published articles has become more and more popular. The bars in the barplot of Figure \ref{fig:barplot} also depict whether published links are still available as of December 2011: the green portion of each bar represents the volume of valid links (HTTP status code 200: OK), while the grey portion of the bars represents broken links (HTTP status codes 3xx, 4xx, and 5xx). This link categorization shows that half or more of all links published prior to 2001 are now broken. The percentage of broken links decreases with time to reach roughly 10\% in 2008: one in ten links included in astronomy papers in 2008 is unreachable three years later. This analysis can be pushed further by exploring two distinct subsets of the astronomy link corpus. In Figure \ref{fig:lines} we show how the percentages of broken links differ over time for a set of $1801$ links to personal websites (links which contain the tilde symbol \~ , which are usually reserved for personal web pages on institutional servers) and a set of $3731$ links to institutional, curated archives (a manually selected list of domains that are obvious astronomy archives, such as \url{archive.stsci.edu}). Attempting to make a distinction between these two categories of links is of crucial importance. The former set of links, the ``tilde links'', are potential pointers to datasets found on personal websites. These may consist of data tables and images which are the product of data analysis and reduction procedures described in the accompanying paper. As such, they do not belong to larger curated archives, which normally host raw data only. Ideally, these datasets would be included in the full text of the article, but oftentimes they are too large to fit within the format of a published paper and are included on a personal server and linked from within the paper. The latter set of links, the ``curated archives'' links is, instead, a collection of pointers to established archives and repositories, managed and curated by institutions, surveys, telescope sites. Authors may want to link to these resources to cite and acknowledge the raw data sources that they employed in their research. Figure \ref{fig:lines} shows that the availability of these two categories of links follow very different, yet expected, patterns. The vast majority of ``tilde links'' published between 1997 and 2003 is not available any more (personal links are depicted as a black solid line and circles). Astronomers change locations, jobs, institutions and, as such, their personal web servers change or expire over time. However, the percentage of broken links to personal websites falls rapidly: nearly all ``tilde links'' published in 2008 are still accessible today. A different scenario emerges when one looks at the temporal pattern for links to curated archives (depicted in the graph as a red line and crosses): the percentage of broken links stays roughly the same over time (between 15\% and 20\%), indicating that curated, institutional websites are much less vulnerable to temporal effects than personal websites. This exploratory analysis reveals three key findings. First, since the inception of the web in the early 1990's, astronomers have increasingly used links in articles to cite datasets and other resources which do not fit in the traditional referencing schemes for bibliographic materials. Second, as for nearly every resource on the web, availability of linked material decays with time: old links to astronomical materials are more likely to be broken than more recent ones. Third, links to ``personal datasets'', i.e., links to potential data hosted on astronomers' personal websites, become unreachable much faster than links to curated ``institutional datasets''. These findings point to a preliminary realization: that astronomers do have a need to reference and include data materials in their published work. Since they lack a standardized mechanism to reference these resources --- data citations do not normally fit in the format, structure, and scope of published journal articles --- they attempt to cite datasets using simple linking from within articles. Results from this preliminary analysis prompted a qualitative interview study, described below.