A Novel Machine Learning Based Approach for Retrieving Information from Receipt Images

AbstractIn this paper we are approaching, from a machine learning perspective, the problem of performing optical character recognition on receipt images and then extracting structured information from the obtained text. Tools that have not been trained specifically for this kind of images do not handle them well usually, because receipts have custom fonts and, due to size constraints, many letters are close to each other. In this paper we adapt existing methods for doing OCR, in order to achieve better performance than off-the-shelf commercial OCR engines and to be able to extract the most accurate information from receipts. Document layout analysis is performed on the receipts, then lines are segmented into characters using Random Forests and finally they are classified using Linear Support Vector Machines. We provide an experimental evaluation of the proposed approach, as well as an analysis of the obtained results.

Introduction

Optical character recognition (OCR) is the ability of the computer to extract textual information from image data, such as pixels(Schantz 1982). This is useful because by introducing the data available on paper, we can use a computer to process, index and search the data much faster.

OCR is a difficult problem, even when done on straight papers, without creases, that are scanned, because first we must identify letters on the page and distinguish them from tables, figures and other objects that might be there, and because there are many kinds of fonts that have to be recognized. The problem of OCR on photographed documents is even more difficult. The illumination can vary, the document might be curved, there might be a skewed perspective and so on.

Information extraction is the process of taking raw, unstructured text and outputting information that is parsed and structured. This makes it easier to search and store the necessary data, because parts of the text that contain nothing useful can be discarded, while the rest is processed, brought to a standard form and is stored in a database.

The paper aims at investigating the use of Random Forests(Breiman 2001) and Support Vector Machines (SVM)(Cortes 1995) in developing an OCR engine that is tailored for receipts. By taking into consideration the constraints imposed by receipts, on both document layout and text font and spacing, we can improve performance compared to other OCR engines that are more general. The introduced approach is novel, since, to our knowledge, other papers for extracting information from receipts focus on improving either the image preprocessing step or on the parsing of text that was extracted using an off-the-shelf OCR engine.

The rest of the paper is structured as follows.

Section \ref{sec:statement} describes the problem statement and the motivation behind developing such an OCR engine. Section \ref{sec:lit_rev} presents some of the related work in the field of character recognition using machine learning approaches. In Section \ref{sec:method} we present the approach used in developing our OCR engine. The experiments we have performed in order to experimentally evaluate our approach are described in Section \ref{sec:exp}. Section \ref{sec:disc} contains an analysis of our approach and a comparison to other OCR engines. Finally, we provide conclusions and pointers towards future work.