Average Paper Size

Finally, we come to the practical motivation for doing this set of measurements. The end goal was to obtain an understanding of the overall distribution of paper sizes, in order to design an adequate processing framework, which won’t run into silly buffer overflows.

We have already seen that the trend is to see super-linear growth in paper sizes over time, so we start with this caveat in mind.

Collecting the disk sizes in each paper directory, we see a rough peak frequency at \(0\) MB. The average size is \(\approx 0.2\) MB, with a variance from a minimum of several KB to a maximum of \(998\) MB, as of end of May, 2015. Here are examples of the smallest and the biggest arXiv papers.

Feel free to play with the detailed “paper size” dataset in the below active figure. Note that in contrast to the previous two figures the data is not in historical order, but rather presents a global aggregate over all arXiv papers. We could also think of plotting the average size by month, where I suspect we would observe a resemblance with Moores’ law.