Alberto Pepe added materials_amd_methods.tex  over 11 years ago

Commit id: 5691afb9e6279b8df1f9dc4f4ba2b0b9fc43e521

deletions | additions      

         

\subsection{Materials and Methods}   We analyze a corpus of all articles published between 1997 and 2008 in the four main   astronomy journals (The Astrophysical Journal, The Astrophysical Journal   Letters, Astronomy \& Astrophysics, The Astronomical Journal) which   contain external URL links in their full text. We initially find $33847$ external links   in $13390$ articles.     In order to isolate potential links to datasets from this list, we   perform the following filtering workflow. First, we remove links to   domains that are scholarly repositories and which obviously do not   host data (or which did not host data prior to 2008). These include   domains such as \url{dx.doi.org}, \url{arxiv.org}, \url{xxx.lanl.gov},   and \url{adsabs.harvard.edu}. Removing links to these domains, which   are obviously pointers to articles, narrows down the corpus to   $26663$.     Second, we remove all links which are found in the reference list of   an article. While it is entirely possible that authors cite datasets in the   same way as they cite bibliographic references, an exploratory analysis revealed that links   in the reference section of a paper were, by and large, pointers to articles, preprints,   star catalogs, circulars, manuals, and user guides. Therefore, we   remove these ``reference links'', bringing the corpus down to $20767$   links.     Third, based on the observation that links to datasets are generally   not found at the root of a website hierarchy, we removed links that   contain less than 2 forward slashes (other than the two slashes found in   the leading ``http://''). For example, the link to   \url{http://www.sdss.org} was dropped from the corpus (0 slashes),   while the link to   \url{http://www.cfa.harvard.edu/COMPLETE/data_html_pages/data.html}   was retained (3 slashes). This final filtering procedure reduces the   corpus to $13447$ links, which we consider potential links to datasets. Some descriptive statistics about this corpus   of links is presented in Table \ref{tab1}.