The strengths of the arXiv
The arXiv has since day one provided researchers with one of the easiest and most powerful ways to disseminate their research. It is a free way for authors to rapidly share findings directly with the research community, and a free way for the public to access it. The arXiv is home to some of the world's most important work, like the proof of the Poincaré conjecture \citep{2002math.....11159P,2003math......3109P,2003math......7245P} and the discovery of the Higgs Boson \cite{1207.7235,1207.7214}. The free exchange of information has been without equal in most other fields for nearly two decades until very recently with the launch of numerous arXiv clones in new disciplines (see Figure \ref{104668}). The ease of use and the utility of arXiv is both a function of the community it serves--technically advanced researchers with a long-standing tradition of sharing and collaboration--as well as the simplicity of the site. Below we highlight key pieces of technology as well as cultural influences that contributed to the success of arXiv. We then highlight in the next section how such pieces may be a limitation to new, and better, practices.
Typesetting with LaTeX
The vast majority of papers on arXiv are authored in LaTeX. LaTeX allows researchers to easily typeset and share their documents. Such a solution was available to all researchers at the outset, however it was only adopted by the exact community it served, namely physicists and mathematicians, who needed to write equation-intensive documents. Thus, LaTeX was crucial to the early success of preprints and peer-to-peer sharing. Today it continues to be used by physicists, mathematicians, computer scientists, and others as it offers the best solution for rendering complex mathematical notation.
A tech-savvy community
The serendipitous arrival of new technology in a community that both knew how to benefit from it and was willing to take advantage of it (Physics), helped the arXiv to flourish from the very first day. Other fields, like chemistry and biomedicine, while increasingly highly collaborative in nature \cite{Fanelli_2016}, may have lacked the early knowledge and interest to write in LaTeX and set up and run email and web servers --- two necessary aspects to the foundation of the arXiv.
The weaknesses of the arXiv
The immediate and sustained success of the arXiv since its inception is due to its willingness to utilize new technology (LaTeX, email, web servers) in a community naturally tech-savvy, collaborative, and open to sharing practices. However, the arXiv has failed to improve and rethink itself over time, to match the ever changing landscape of technology and community practices in science. What is the single most important factor that has prevented the arXiv to quickly innovate? We believe it is LaTeX. The same technological advancement that has allowed the arXiv to flourish, is also, incredibly, its most important shortcoming. Indeed, the reliance of the arXiv on LaTeX is the source of all the weaknesses listed below.
Limitation to a single community
Most researchers outside of physics, and consequently outside of the arXiv world, write their manuscripts in Microsoft Word or other WYSIWYG editors. Using LaTeX penetration rates in its most popular fields (mathematics, statistics, physics, astronomy, computer science) it is possible to estimate the total percentage of scholarly articles written in LaTeX to be around 18% \cite{Pepea}. Not only does LaTeX have a steep learning curve; its interface, language, and modus operandi are foreign to anyone who does not program or to anyone who has only ever used WYSIWYG word processors. The arXiv's decision to allow upload of Microsoft Word files is only a peripheral and suboptimal solution, since the site is so intrinsically built on and around LaTeX.
A printer-centric "PDF dump"
Whether you upload a LaTeX or a Word file, the arXiv converts your content to PDF format. This is a standard procedure. In academia, for decades manuscripts have been exchanged and read in Postscript or PDF format. PDFs are an efficient, portable format for printing manuscripts. But the PDF is not a format fit for sharing, discussing, and reading on the web. PDFs are (mostly) static, 2-dimensional and non-actionable objects. It is not a stretch to say that a PDF is merely a digital photograph of a piece of paper.
Low discoverability
The research products hosted by the arXiv are PDFs. A title, abstract, and author list are provided by the authors upon submission as metadata, which is posted alongside the PDF, and is rendered in HTML to aid article discoverability. While search engines are getting better at text mining PDFs, the chances that any current or future search engine will meaningfully extract and interpret text from a dense 2-column paper are low. Importantly, it is a futile exercise of reverse engineering. Why are we locking content in a format that is not machine-readable?
Data
Data sharing has become a fundamental practice across all scholarly disciplines. Simply, if a published research paper is built on data, the authors have to provide access to the minimal set of resources (data and code) upon which their research is based. But sharing data in arXiv's "LaTeX to PDF" paradigm is not possible. A pilot to support data deposit alongside papers which was run at the arXiv from 2010 to 2013 \cite{Mayernik_2012}, failed to gain traction. While the project had to face an unexpected cut in government support, we believe that part of its failure can be associated with the fact that the papers and the data were deposited as separate entities. How do people share data today? They use kludgy strategies. A growing trend in astronomy and physics, for example, is to link the dataset in the published or preprinted paper. This practice allows authors to make their data more visible and get credit for it, as it is linked inside the papers, but recent work shows that links rot quickly with time \cite{Pepe_2014}.
How the arXiv would look like if we built it today
A useful exercise when attempting to imagine the arXiv of the future is to envision what it would look like if we could rebuild it today. We would like to consider the weaknesses listed above as opportunities rather than challenges, and in doing so, we offer here some ideas for a better arXiv.