A Novel Machine Learning Based Approach for Retrieving Information from Receipt Images

loading page

Roland Szabo

Abstract

In this paper we are approaching, from a machine learning perspective, the problem of performing optical character recognition on receipt images and then extracting structured information from the obtained text. Tools that have not been trained specifically for this kind of images do not handle them well usually, because receipts have custom fonts and, due to size constraints, many letters are close to each other. In this paper we adapt existing methods for doing OCR, in order to achieve better performance than off-the-shelf commercial OCR engines and to be able to extract the most accurate information from receipts. Document layout analysis is performed on the receipts, then lines are segmented into characters using Random Forests and finally they are classified using Linear Support Vector Machines. We provide an experimental evaluation of the proposed approach, as well as an analysis of the obtained results.