Deyan Ginev edited section_Takeaway_Understanding_a_dataset__.tex  almost 9 years ago

Commit id: b22801d5a7983bd365d824353c970d7a7727068b

deletions | additions      

       

Understanding a dataset is key to effectively operating on it. While pure NLP applications would exclude all non-textual modalities and see much more manageable sizes, applications such as LaTeXML, which operate also on all supplementary data, require the full directory of an arXiv paper.  We now understand that this will pose challenges with round-tripping inputs averaging $~50$MB, but varying all be easy in  the way to $5$GB. vast majority of cases, as our overage job is 200KB in size. Nevertheless, there will be a low number of high volume jobs, with a maximum payload of just under 1GB.  Understanding the LaTeXML process, we can also anticipate a multiplier of 4-5 on the return size of the textual content. However, it is safe to assume that the average paper is dominated in size by its supplementary non-textual information. content, or $~1MB$.  Ideas that already come to mind are:  \begin{itemize}  \item detecting and  streaming large payloads, \item resizing large images to an acceptable quality, both on input and output,  \item preparing for large amplitudes in the distribution load, due to the large variance in paper size  \item expecting two distinct types of extreme conditions - very large throughput when processing a majority of small size jobs, as well as the occasional multi-GB buffer widths.  \end{itemize}  Exciting work ahead! The script that generated the data behind the figures in this post can be found in the freshly started \href{https://github.com/dginev/rust-cortex}{Rust port of the CorTeX framework}, which can now benefit from a better understanding of the TeX papers in arXiv.org dataset.