Problem statement and relevance

\label{sec:statement} In this section we present the problem of OCR and we provide the motivation behind developing an OCR engine.

Optical Character Recognition. Background

The first OCR engines used a set of handwritten rules to identify characters\cite{shepard1971reading}. These were hard to write and performed quite poorly on text written in new fonts. This kind of systems are fine when used on documents that are highly standardized and that always have the same font in the same place, such as passports, bank checks or credit cards.

More modern OCR engines use a machine learning algorithm to learn the rules by which to classify the characters\cite{smith2007overview}. They are more accurate and can easily learn to identify multiple fonts. While the rules are not handwritten, getting the labeled data for the machine learning algorithm is still a manual and tedious work.

While the general problem of object recognition is still a difficult one for computers, even though humans do it without a problem, recently there have been several breakthroughs in computer vision that give very good performance, in some cases even better than human, on simpler problems, such as character recognition.

OCR for Receipts. Motivation

Usually, an OCR engine has 2 main components: a document layout analysis part and a character recognition part.

In the case of receipts, the document layout is quite simple: all the text is on horizontal lines, there are no tables, and few figures, usually consisting of the logo of the shop. Identifying the lines can be done quite efficiently by looking at the color histogram of the receipt.

The character recognition is more complicated. While in a book the letters are almost always well separated and the lines have similar length, receipts are smaller, so they have less space available and everything is compressed as much as possible. This means that many letters end up touching, as show in figure \ref{fig:line}, lines are of uneven length and have different justifications (left, right and center justifications alternate many times in a receipt). This means that before character recognition can be done, lines must be segmented into their composing letters.

In this paper we are researching the best ways to perform the character segmentation and then recognition, using only the data available from the image.