Alyssa Goodman edited Results_exploratory_analysis.tex  over 10 years ago

Commit id: 13d26d48e776c4030eb27d7b84043ccf90cf090e

deletions | additions      

       

astronomers use links within articles to point to datasets and related  supplemental data resources.  We analyze a corpus of all articles published in the four main American  astronomy journals (The Astrophysical Journal, The Astrophysical Journal  Letters, The Astrophysical Journal Supplement, The Astronomical Journal) between  1997 and  2008. We find a total of $13447$ $13,447$  potential links to datasets in a total of $7641$ $7,641$  publications. The detailed procedure by which potential data links are selected and filtered is  described in the Materials and Methods section.  In the barplot of Figure \ref{fig:fig1} we show how linking  practices have changed over time. Links to potential data resources in astronomy first appear in 1997, with only a couple of dozens links published in that year, and the number  quickly increases every each  year toreach  around $1500$ yearly $1,500$  links in 2005. After 2005, the volume of total published links roughly stays  the same every year. The graph shows that with widespread use and  adoption of the WWW, linking Web, showing links  to online resources within published articles has become becoming  more and more popular. The bars in the barplot of Figure \ref{fig:fig1} also depict whether published links are were  still available as of December 2011: the green portion of each bar represents  the volume of valid links (HTTP status code 200: OK), while the grey  portion of the bars represents broken links (HTTP status codes 3xx,  4xx, and 5xx). This link categorization shows that half or more of all  links published prior to 2001 are now broken. were broken by 2011.  The percentage of broken links decreases with time to reach time, reaching  roughly 10\% in 2008: one in ten links included in astronomy papers in 2008 is unreachable three years later. This analysis can be pushed further by exploring two distinct subsets  of the astronomy link corpus. In Figure \ref{fig:fig2} we show how  the percentages of broken links differ over time for a set of $1801$ $1,801$  links to personal websites (links (approximated as links  which contain the tilde symbol \~ , which are usually reserved for personal web pages on institutional servers)  and a set of $3731$ $3,731$  links to institutional, curated archives (a manually selected list of domains that are obvious astronomy archives, such as  \url{archive.stsci.edu}). Attempting to make a distinction between these two categories of links is of crucial importance. The former set  of links, the ``tilde links'', are potential pointers to datasets  found on personal websites. These may consist of data tables and  images which are the product of data analysis and reduction procedures  described in the accompanying paper.   As such, they do not belong to larger curated archives, which  normally typically  host raw data only. Ideally, these datasets would be included in the full text of the article, but oftentimes they are too large to  fit within the format of a published paper and are included on a  personal server and linked from within the paper. The latter set of 

data hosted on astronomers' personal websites, become unreachable much  faster than links to curated ``institutional datasets''.   These findings point to a preliminary realization:that  astronomers do   have a appreciate, but cannot reliably meet, the  need to reference and include data materials in their published work. work in order to preserve its value.  Since they lack a standardized mechanism to reference these resources --- data citations do not normally fit in the format, structure, and scope of published journal articles --- they  attempt to cite datasets using simple linking from within  articles. Results from this preliminary analysis prompted a