Alberto Pepe

and 1 more

INTRODUCTION In the early 1600s, Galileo Galilei turned a telescope toward Jupiter. In his log book each night, he drew to-scale schematic diagrams of Jupiter and some oddly-moving points of light near it. Galileo labeled each drawing with the date. Eventually he used his observations to conclude that the Earth orbits the Sun, just as the four Galilean moons orbit Jupiter. History shows Galileo to be much more than an astronomical hero, though. His clear and careful record keeping and publication style not only let Galileo understand the Solar System, it continues to let _anyone_ understand _how_ Galileo did it. Galileo’s notes directly integrated his DATA (drawings of Jupiter and its moons), key METADATA (timing of each observation, weather, telescope properties), and TEXT (descriptions of methods, analysis, and conclusions). Critically, when Galileo included the information from those notes in _Siderius Nuncius_ , this integration of text, data and metadata was preserved, as shown in Figure 1. Galileo's work advanced the "Scientific Revolution," and his approach to observation and analysis contributed significantly to the shaping of today's modern "Scientific Method" . Today most research projects are considered complete when a journal article based on the analysis has been written and published. Trouble is, unlike Galileo's report in _Siderius Nuncius_, the amount of real data and data description in modern publications is almost never sufficient to repeat or even statistically verify a study being presented. Worse, researchers wishing to build upon and extend work presented in the literature often have trouble recovering data associated with an article after it has been published. More often than scientists would like to admit, they cannot even recover the data associated with their own published works. Complicating the modern situation, the words "data" and "analysis" have a wider variety of definitions today than at the time of Galileo. Theoretical investigations can create large "data" sets through simulations (e.g. The Millennium Simulation Project). Large scale data collection often takes place as a community-wide effort (e.g. The Human Genome project), which leads to gigantic online "databases" (organized collections of data). Computers are so essential in simulations, and in the processing of experimental and observational data, that it is also often hard to draw a dividing line between "data" and "analysis" (or "code") when discussing the care and feeding of "data." Sometimes, a copy of the code used to create or process data is so essential to the use of those data that the code should almost be thought of as part of the "metadata" description of the data. Other times, the code used in a scientific study is more separable from the data, but even then, many preservation and sharing principles apply to code just as well as they do to data. So how do we go about caring for and feeding data? Extra work, no doubt, is associated with nurturing your data, but care up front will save time and increase insight later. Even though a growing number of researchers, especially in large collaborations, know that conducting research with sharing and reuse in mind is essential, it still requires a paradigm shift. Most people are still motivated by piling up publications and by getting to the next one as soon as possible. But, the more we scientists find ourselves wishing we had access to extant but now unfindable data , the more we will realize why bad data management is bad for science. How can we improve? THIS ARTICLE OFFERS A SHORT GUIDE TO THE STEPS SCIENTISTS CAN TAKE TO ENSURE THAT THEIR DATA AND ASSOCIATED ANALYSES CONTINUE TO BE OF VALUE AND TO BE RECOGNIZED. In just the past few years, hundreds of scholarly papers and reports have been written on questions of data sharing, data provenance, research reproducibility, licensing, attribution, privacy, and more--but our goal here is _not_ to review that literature. Instead, we present a short guide intended for researchers who want to know why it is important to "care for and feed" data, with some practical advice on how to do that. The set of Appendices at the close of this work offer links to the types of services referred to throughout the text. BOLDFACE LETTERING below highlights actions one can take to follow the suggested rules.

Josh Nicholson

and 1 more

Research is really f**king important.  This statement is almost self-evident by the fact that you're reading this online.  From research has come the web, life-saving vaccines, pasteurization, and countless other advancements. In other words, you can look at cat gifs all day because of research, you're alive because of research, and you can safely add milk to your coffee or tea without contracting some disease, because of research. But how research is done today is being stymied by how it is being communicated.  Most research is locked behind expensive paywalls \cite{Bj_rk_2010}, is not communicated to the public or scientific community until months or years after the experiments are done \cite{trickydoi}, is biased in how it is reported - only "positive" results are typically published \cite{Ahmed_2012}, does not supply the underlying data to major studies \cite{Alsheikh_Ali_2011}, and has been found to be irreproducible at alarming rates \cite{Begley_2012}.Why is science communication so broken?Many would blame the fault of old profit-hungry publishers, like Elsevier, and in many respects, that blame is deserved. However, here's a different hypothesis: what is holding us back from a real shift in the research communication industry is not Elsevier, it's Microsoft Word. Yes, Word, the same application that introduced us to Clippy is the real impediment to effective communication in research.Today, researchers are judged by their publications, both in terms of quantity and prestige.  Accordingly, researchers write up their documents and send them to the most prestigious journals they think they can publish in.  The journals, owned by large multinational corporations, charge researchers to publish their work and then again charge institutions to subscribe to the content. Such subscriptions can run into the many millions of dollars per year per institution \cite{Lawson_2015} with individual access costing $30-60 per article.The system and process for publishing and disseminating research is inimical to scientific advancement and accordingly Open Access and Open Science movements have made big steps towards improving how research is disseminated. Recently, Germany, Peru, and Taiwan have boycotted subscriptions to Elsevier \cite{Schiermeier_2016} and an ongoing boycott to publish or review for certain publishers has accumulated the signatures of 16,493 researchers and counting.  New developments such as Sci-hub, have helped to make research accessible, albeit illegally.  While regarded as a victory by many, the Sci-hub approach is not the solution that researchers are hoping for as it is built on an illegal system of exchanging copyrighted content and bypassing publisher paywalls \cite{Priego}.  The most interesting technologist view of the matter is that the real culprit for keeping science closed isn't actually the oligopoly of publishers \cite{Larivi_re_2015}-- after all, they're for-profit companies trying to run businesses and they're entitled to do any legal thing that helps them deliver value to shareholders. We suggest that a concrete solution for true open access is already out there and it's 100% legal.What is the best solution to truly and legally open access to research?The solution is publishing preprints -- the last version of a paper that belongs to an author before it is submitted to a journal for peer review. Unlike other industries (e.g. literature, music, film, etc.), in research, the preprint version copyright is legally held by the author, even after publication of the work in a journal.Pre-prints are rapidly gaining adoption in the scientific community, with a couple of preprint servers (e.g. arXiv which is run by Cornell University and is primarily for physics papers, and bioRxiv which is similarly for biology papers) receiving thousands of preprints per month.Some of the multinationals are responding with threats against authors not to publish (or post) preprints. However they are being met with fierce opposition from the scientific community, and the tide seems to be turning. Multinationals are now under immense pressure not just from authors in the scientific community, but increasingly from the sources of public and private funding for the actual research. Some organizations are even mandating preprints as a condition of funding. But what is holding back preprints and in general a better way for Authors to have more control of their research?We think the inability for scientists to independently produce and disseminate their work is a major impediment and at the heart of that of that problem is how scientists write. How can Microsoft Word harm scientific communication?Whereas other industries, like the music industry, have been radically transformed and accelerated by providing creators with powerful tools like Youtube, there is no parallel in research.  Researchers are reliant upon publishers to get their ideas out and because of this, they are forced into an antiquated system that has remained largely stagnant since it's inception over 350 years ago.Whereas a minority of researchers in math-heavy disciplines write using typesetting formats like LaTeX, the large majority of researchers (~82%) write their documents in Microsoft Word \cite{brischoux2009don}. Word is easy to use for basic editing but is essentially incompatible with online publishing. Word was created for the personal computer: offline, single-author use. Also, it was not built with scientific research in mind - as such, it lacks support for complex objects like tables and math, data, and code. All in all, Word is extraordinarily feature-poor compared to what we can accomplish today with an online collaborative platform. Because publishers have traditionally accepted manuscripts formatted in Word, and because they consistently fail to truly innovate from a technological standpoint, millions of researchers find themselves using Word. In turn, the research they publish is non-discoverable on the web, data-less, non-actionable, not reusable and, most likely, behind a paywall.  What does the scientific communication ecosystem of the future look like?What is needed is a web-first solution. Research articles should be available on distinct web pages, Wikipedia style. Real data should live underneath the tables and figures. Research needs to finally be machine readable (instead of just tagged with keywords) so that it may be found and processed by search engines and machines. Modern research also deserves to have rich media enhancement -- visualizations, videos, and other forms of rich data in the document itself.All told, researchers need to be able to disseminate their ideas in a web first world, while playing the "Journal game" as long as it exists. Our particular dream (www.authorea.com) is to construct a democratic platform for scientific research -- a vast organizational space for scientists to read and contribute cutting edge science. There is a new class of startups out there doing similar things with the research cycle, and we feel like there is a real and urgent demand for solutions right now in research.