The Golden Road to Open Science

If you are a scholar and haven't spent the last ten years in a vacuum, you have heard of Open Access: the emerging practice of providing unrestricted access to peer-reviewed scholarly works, such as journal articles, conference papers, and book chapters. Open Access comes in two flavors, green and gold, depending on how an article is made available to the public. The "green road" to Open Access involves an author making her article publicly available after publication, e.g. by depositing the article's post-print in an open institutional repository. According to many, the preferred avenue to achieve Open Access, however, is the "golden road" which happens when an author publishes an article directly in an OA journal. The fact that Open Access, regardless of its flavor, has innumerable benefits for researchers and the public at large is beyond discussion --- even the most traditional scholarly publishers would have to agree. Importantly, the vision of universal Open Access to scholarly knowledge, i.e., the idea that the entire body of published scholarship should be made available to everyone free of charge, is not too far fetched. In practice, by a combination of green and golden OA practices, this vision is already a reality in some scientific fields, such as physics and astronomy.

So: Open Access is both fundamentally necessary and bound to happen. But, whether Open Access, alone, can guarantee reproducibility and transparency of research results is a different and compelling question. Do research articles contain enough information to exactly (or even approximately) replicate a scientific study? Unfortunately, very often the answer to this question is no. As science, and scholarship in general, become inevitably more computational in nature, the experiments, calculations, and analyses performed by researchers are too many and too complex to be described in detail in a research article. As such, the minutiae of research activity are often hidden from view, making science unintelligible and irreproducible, not only for the public at large, but also for scientists, experts and, paradoxically, even for the same scientists who conducted the research in the first place, who may have not documented their exact workflows elsewhere. A parallel movement to Open Access --- Open Science --- is building up momentum in scholarly circles. Its mission is to provide open, universal access to the full sources of scientific research.

The problem at hand is that the type of science we conduct today does not fit in the format and scope of the scholarly article. The code to assemble and statistically analyze a dataset, the workflows employed to visualize that dataset as a plot, and the dataset itself are three examples of research materials which cannot realistically fit in an article as we know it, both for their size and for their scope. To overcome this crucial problem, libraries, governments, and funding bodies, are starting to require data and other ancillary materials to be distributed alongside papers so that the entire lifecycle of research can be reconstructed. By parallel with the flavors of Open Access, this strategy --- of providing access to scientific sources after the publication of a scholarly article --- can be thought of as the "green road" to Open Science.

The green road to Open Science, however advantageous, is not without its shortcomings. There are at least two reasons that make green-flavored Open Science tortuous and impractical. The first has to do with curation. Depositing research materials alongside publications often means that fulltext and data will live on different repositories, hosted by different bodies, under different regulations and practices. Making sense of and curating the conceptual links that exist among papers and research materials is already difficult today. Was this plot generated using dataset one or dataset two? Was the data analyzed using the first or the second version of the code? Answers to these questions may just be impossible to obtain today, let alone in a few decades. The second problem has to do with incentives. Those familiar with the recent NSF Data Management Plans --- which mandate publication of data sources alongside published papers resulting from NSF-funded research --- know very well that complying with the mandate was a big headache. Once a paper is published, authors have very little incentive to publish data and the full sources of their research. Asking authors to deposit data after having authored its host article --- the green road to Open Science --- is a partial and unsustainable solution.