Main Data History
Show Index Toggle 0 comments
  •  Quick Edit
  • The Golden Road to Open Science

    If you are a scholar and haven't spent the last ten years in a vacuum, you have heard of Open Access: the emerging practice of providing unrestricted access to peer-reviewed scholarly works, such as journal articles, conference papers, and book chapters. Open Access comes in two flavors, green and gold, depending on how an article is made available to the public. The "green road" to Open Access involves an author making her article publicly available after publication, e.g. by depositing the article's post-print in an open institutional repository. According to many, the preferred avenue to achieve Open Access, however, is the "golden road" which happens when an author publishes an article directly in an OA journal. The fact that Open Access, regardless of its flavor, has innumerable benefits for researchers and the public at large is beyond discussion --- even the most traditional scholarly publishers would have to agree. Importantly, the vision of universal Open Access to scholarly knowledge, i.e., the idea that the entire body of published scholarship should be made available to everyone free of charge, is not too far fetched. In practice, by a combination of green and golden OA practices, this vision is already a reality in some scientific fields, such as physics and astronomy.

    So: Open Access is both fundamentally necessary and bound to happen. But, whether Open Access, alone, can guarantee reproducibility and transparency of research results is a different and compelling question. Do research articles contain enough information to exactly (or even approximately) replicate a scientific study? Unfortunately, very often the answer to this question is no. As science, and scholarship in general, become inevitably more computational in nature, the experiments, calculations, and analyses performed by researchers are too many and too complex to be described in detail in a research article. As such, the minutiae of research activity are often hidden from view, making science unintelligible and irreproducible, not only for the public at large, but also for scientists, experts and, paradoxically, even for the same scientists who conducted the research in the first place, who may have not documented their exact workflows elsewhere. A parallel movement to Open Access --- Open Science --- is building up momentum in scholarly circles. Its mission is to provide open, universal access to the full sources of scientific research.

    The problem at hand is that the type of science we conduct today does not fit in the format and scope of the scholarly article. The code to assemble and statistically analyze a dataset, the workflows employed to visualize that dataset as a plot, and the dataset itself are three examples of research materials which cannot realistically fit in an article as we know it, both for their size and for their scope. To overcome this crucial problem, libraries, governments, and funding bodies, are starting to require data and other ancillary materials to be distributed alongside papers so that the entire lifecycle of research can be reconstructed. By parallel with the flavors of Open Access, this strategy --- of providing access to scientific sources after the publication of a scholarly article --- can be thought of as the "green road" to Open Science.

    The green road to Open Science, however advantageous, is not without its shortcomings. There are at least two reasons that make green-flavored Open Science tortuous and impractical. The first has to do with curation. Depositing research materials alongside publications often means that fulltext and data will live on different repositories, hosted by different bodies, under different regulations and practices. Making sense of and curating the conceptual links that exist among papers and research materials is already difficult today. Was this plot generated using dataset one or dataset two? Was the data analyzed using the first or the second version of the code? Answers to these questions may just be impossible to obtain today, let alone in a few decades. The second problem has to do with incentives. Those familiar with the recent NSF Data Management Plans --- which mandate publication of data sources alongside published papers resulting from NSF-funded research --- know very well that complying with the mandate was a big headache. Once a paper is published, authors have very little incentive to publish data and the full sources of their research. Asking authors to deposit data after having authored its host article --- the green road to Open Science --- is a partial and unsustainable solution.

    We advocate that the only route to provide universal, intelligible access to the minutiae of scholarship is the golden road: research materials need to be published as they are generated, while an article is being written. In other words, the sources of scientific research have to become integral components of research articles. Datasets, code, and workflows should not be thought or deposited as separate entities from articles. Rather, they should be considered for what they really are: not ancillary materials, but the very foundations upon which a research article is built.

    The true problem at hand then lies in the format of the scholarly article. The articles we exchange today, even when in digital format (e.g., PDFs) are essentially image renderings (yes, photographs) of physical print papers. So, while most research materials are born-digital (e.g., the code to produce a plot is dynamic, editable and it is "executed" to generate a figure), the scholarly papers in which they are published are essentially static and analog. Traditional scholarly articles fail to communicate modern science, for they only provide a description of the surface of research. Strikingly, scientists produce 21st century research, written up on 20th century tools, packaged in a 17th century format

    One solution to improve the function and format of articles is to allow them to be rich, dynamic, multi-media objects: articles which can be "executed", in the same way code is executed to produce a plot. A reader perusing such an article would be able to interact with it, and extend upon it. For example, in addition to being able to download a statistical plot in the paper as an image, she would also be able to download its underlying data, and the workflows associated with the image, such as the R commands run to analyze the data, and the Python matplotlib scripts used to generate the plot. In other words, she would be able to "execute" these materials, such as data and code, and generate the plot thus reproducing the original research work of the authors. In addition to reproducing research, she should also be able to extend upon research, e.g., using the data and code behind a plot in her own way, giving birth to her own strand of research. In versioning control systems for source code, this function of borrowing someone's work (saving its provenance, thus giving credit to previous creators) is called "forking". Wouldn't science benefit from a similar forking feature for images, tables, equations, and code alike? If we think so, if we truly believe in the vision of Open Science, then it is about time that we, scientists and scholars alike, undertake the golden road and begin writing articles in a new way.