Authorea

Roland Szabo edited problem_statement.tex almost 10 years ago

Commit id: 1567bb7f9db2289345c73578d02f80164831aa96

deletions | additions

\section{Problem Statement and relevance} In this section we present the problem of OCR and we provide the motivation behind developing an OCR engine. \subsection{Optical Character Recognition. Background} The first OCR engines used a set of handwritten rules to identify characters. These were hard to write and performed quite poorly on text written in new fonts. This kind of systems are fine when used on documents that are highly standardized and that always have the same font in the same place, such as passports or bank checks. More modern OCR engines use a machine learning algorithm to learn the rules by which to classify the characters. They are more accurate and can easily learn to identify multiple fonts. While the rules are not handwritten, getting the labeled data for the machine learning algorithm is still a manual and tedious work. While the general problem of object recognition is still a difficult one for computers, even though humans do it without a problem, recently there have been several breakthroughs in computer vision that give very good performance, in some cases even better than human, on simpler problems, such as character recognition. \subsection{OCR for receipts. Motivation} Usually, an OCR engine has 2 main components: a document layout analysis part and a character recognition part. In the case of receipts, the document layout is quite simple: all the text is on horizontal lines, there are no tables, and few figures, usually consisting of the logo of the shop. Identifying the lines can be done quite efficiently by looking at the color histogram of the receipt. The character recognition is more complicated. While in a book the letters are almost always well separated and the lines have similar length, receipts are smaller, so they have less space available and everything is compressed as much as possible. This means that many letters end up touching, lines are of uneven length and have different justifications (left, right and center justifications alternate many times in a receipt). This means that before character recognition can be done, lines must be segmented into their composing letters. We are researching in this paper the best ways to perform the character segmentation and then recognition, using only the data available from the image.