deletions | additions
diff --git a/Results_exploratory_analysis.tex b/Results_exploratory_analysis.tex
index 28c6485..54a233f 100644
--- a/Results_exploratory_analysis.tex
+++ b/Results_exploratory_analysis.tex
...
\subsection{Results}
\subsubsection{Exploratory analysis of data citation practices}
To begin, we mine a corpus of astronomy articles for external web
links. By ``external web link'' we mean: any outgoing link embedded in the
final published version of an article (e.g., its PDF or HTML format)
which points to an online resource in the \url{http} (or \url{https}) URI
scheme. The purpose of this exploratory analysis is to assess whether
astronomers use links within articles to point to datasets and related
supplemental data resources.
We analyze a corpus of all articles published in the four main
astronomy journals (The Astrophysical Journal, The Astrophysical Journal
Letters,
Astronomy \& Astrophysics, The Astrophysical Journal Supplement, The Astronomical Journal) between
1997 and
2008. We find a total of $13447$ potential links to datasets in a
total of $7641$ publications. The detailed procedure by which
potential data links are selected and filtered is
described in the Materials and Methods section.
In the barplot of Figure \ref{fig:barplot} we show how linking
practices have changed over
time. Links to potential data resources in astronomy first appear in
1997, with only a couple of dozens links published in that year, and
quickly increases every year to reach around $1500$ yearly links in
2005. After 2005, the volume of total published links roughly stays
the same every year. The graph shows that with widespread use and
adoption of the WWW, linking to online resources within published
articles has become more and more popular. The bars in the barplot of Figure
\ref{fig:barplot} also depict whether published links are still
available as of December 2011: the green portion of each bar represents
the volume of valid links (HTTP status code 200: OK), while the grey
portion of the bars represents broken links (HTTP status codes 3xx,
4xx, and 5xx). This link categorization shows that half or more of all
links published prior to 2001 are now broken. The percentage of broken
links decreases with time to reach roughly 10\% in 2008: one in ten links
included in astronomy papers in 2008 is unreachable three
years later.
This analysis can be pushed further by exploring two distinct subsets
of the astronomy link corpus. In Figure \ref{fig:lines} we show how
the percentages of broken links differ over time for a set of $1801$ links to personal
websites (links which contain the tilde symbol \~ , which
are usually reserved for personal web pages on institutional servers)
and a set of $3731$ links to institutional, curated archives (a manually
selected list of domains that are obvious astronomy archives, such as
\url{archive.stsci.edu}). Attempting to make a distinction between
these two categories of links is of crucial importance. The former set
of links, the ``tilde links'', are potential pointers to datasets
found on personal websites. These may consist of data tables and
images which are the product of data analysis and reduction procedures
described in the accompanying paper.
As such, they do not belong to larger curated archives, which
normally host raw data only. Ideally, these datasets would be included
in the full text of the article, but oftentimes they are too large to
fit within the format of a published paper and are included on a
personal server and linked from within the paper. The latter set of
links, the ``curated archives'' links is, instead, a collection of
pointers to established archives and repositories, managed and curated
by institutions, surveys, telescope sites. Authors may want to link to
these resources to cite and acknowledge the raw data sources that they employed in
their research. Figure \ref{fig:lines} shows that the availability of these
two categories of links follow very different, yet expected,
patterns. The vast majority of ``tilde links'' published between 1997
and 2003 is not available any more (personal links are depicted as
a black solid line and circles). Astronomers change locations, jobs,
institutions and, as such, their personal web servers change or expire
over time. However, the percentage of broken links to personal
websites falls rapidly: nearly all ``tilde links'' published in 2008
are still accessible today. A different scenario emerges when one
looks at the temporal pattern for links to curated archives
(depicted in the graph as a red line and crosses): the percentage of
broken links stays roughly the same over time (between 15\% and
20\%), indicating that curated, institutional websites are much less
vulnerable to temporal effects than personal websites.
This exploratory analysis reveals three key findings. First, since the
inception of the web in the early 1990's, astronomers have
increasingly used links in articles to cite datasets and other
resources which do not fit in the traditional referencing schemes for
bibliographic materials. Second, as for nearly every resource on the web,
availability of linked material decays with time: old links to
astronomical materials are more likely to be broken than more recent
ones. Third, links to ``personal datasets'', i.e., links to potential
data hosted on astronomers' personal websites, become unreachable much
faster than links to curated ``institutional datasets''.
These findings point to a preliminary realization: that astronomers do
have a need to reference and include data materials in their
published work. Since they lack a standardized mechanism to reference these resources ---
data citations do not normally fit in the
format, structure, and scope of published journal articles --- they
attempt to cite datasets using simple linking from within
articles. Results from this preliminary analysis prompted a
qualitative interview study, described below.