A Novel Machine Learning Based Approach for Retrieving Information from Receipt Images


In this paper we are approaching, from a machine learning perspective, the problem of performing optical character recognition on receipt images and then extracting structured information from the obtained text. Tools that have not been trained specifically for this kind of images do not handle them well usually, because receipts have custom fonts and, due to size constraints, many letters are close to each other. In this paper we adapt existing methods for doing OCR, in order to achieve better performance than off-the-shelf commercial OCR engines and to be able to extract the most accurate information from receipts. Document layout analysis is performed on the receipts, then lines are segmented into characters using Random Forests and finally they are classified using Linear Support Vector Machines. We provide an experimental evaluation of the proposed approach, as well as an analysis of the obtained results.


Optical character recognition (OCR) is the ability of the computer to extract textual information from image data, such as pixels(Schantz 1982). This is useful because by introducing the data available on paper, we can use a computer to process, index and search the data much faster.

OCR is a difficult problem, even when done on straight papers, without creases, that are scanned, because first we must identify letters on the page and distinguish them from tables, figures and other objects that might be there, and because there are many kinds of fonts that have to be recognized. The problem of OCR on photographed documents is even more difficult. The illumination can vary, the document might be curved, there might be a skewed perspective and so on.

Information extraction is the process of taking raw, unstructured text and outputting information that is parsed and structured. This makes it easier to search and store the necessary data, because parts of the text that contain nothing useful can be discarded, while the rest is processed, brought to a standard form and is stored in a database.

The paper aims at investigating the use of Random Forests(Breiman 2001) and Support Vector Machines (SVM)(Cortes 1995) in developing an OCR engine that is tailored for receipts. By taking into consideration the constraints imposed by receipts, on both document layout and text font and spacing, we can improve performance compared to other OCR engines that are more general. The introduced approach is novel, since, to our knowledge, other papers for extracting information from receipts focus on improving either the image preprocessing step or on the parsing of text that was extracted using an off-the-shelf OCR engine.

The rest of the paper is structured as follows.

Section \ref{sec:statement} describes the problem statement and the motivation behind developing such an OCR engine. Section \ref{sec:lit_rev} presents some of the related work in the field of character recognition using machine learning approaches. In Section \ref{sec:method} we present the approach used in developing our OCR engine. The experiments we have performed in order to experimentally evaluate our approach are described in Section \ref{sec:exp}. Section \ref{sec:disc} contains an analysis of our approach and a comparison to other OCR engines. Finally, we provide conclusions and pointers towards future work.

Problem statement and relevance

\label{sec:statement} In this section we present the problem of OCR and we provide the motivation behind developing an OCR engine.

Optical Character Recognition. Background

The first OCR engines used a set of handwritten rules to identify characters(Shepard 1971). These were hard to write and performed quite poorly on text written in new fonts. This kind of systems are fine when used on documents that are highly standardized and that always have the same font in the same place, such as passports, bank checks or credit cards.

More modern OCR engines use a machine learning algorithm to learn the rules by which to classify the characters(Smith 2007). They are more accurate and can easily learn to identify multiple fonts. While the rules are not handwritten, getting the labeled data for the machine learning algorithm is still a manual and tedious work.

While the general problem of object recognition is still a difficult one for computers, even though humans do it without a problem, recently there have been several breakthroughs in computer vision that give very good performance, in some cases even better than human, on simpler problems, such as character recognition.

OCR for Receipts. Motivation

Usually, an OCR engine has 2 main components: a document layout analysis part and a character recognition part.

In the case of receipts, the document layout is quite simple: all the text is on horizontal lines, there are no tables, and few figures, usually consisting of the logo of the shop. Identifying the lines can be done quite efficiently by looking at the color histogram of the receipt.

The character recognition is more complicated. While in a book the letters are almost always well separated and the lines have similar length, receipts are smaller, so they have less space available and everything is compressed as much as possible. This means that many letters end up touching, as show in figure \ref{fig:line}, lines are of uneven length and have different justifications (left, right and center justifications alternate many times in a receipt). This means that before character recognition can be done, lines must be segmented into their composing letters.

In this paper we are researching the best ways to perform the character segmentation and then recognition, using only the data available from the image.

\label{fig:line} Example of line where many characters are touching.