Authorea

Roland Szabo edited introduction.tex almost 10 years ago

Commit id: 4e6592280b2da85e60a4193b238623b84d0a0c4a

deletions | additions

OCR is a difficult problem, even when done on straight papers, without creases, that are scanned, because first we must identify letters on the page and distinguish them from tables, figures and other objects that might be there, and because there are many kinds of fonts that have to be recognized. The problem of OCR on photographed documents is even more difficult. The illumination can vary, the document might be curved, there might be a skewed perspective and so on. Information extraction is the process of taking raw, unstructured text and outputting information that is parsed and structured. This makes it easier to search and store the necessary data, because parts of the text that contain nothing useful can be discarded, while the rest is processed, brought to a standard form and is stored in a database. The paper aims at investigating the use of Random Forests and Support Vector Machines (SVM) in developing an OCR engine that is tailored for receipts. Information extraction The rest of the paper is structured as follows. ce abordez, complexitate The paper aims at The rest Section II describes the problem statement and the motivation behind developing such an OCR engine. Section III presents some of the related work in the field of character recognition using machine learning approaches. In Section IV we present the approach used in developing our OCR engine. In Section V we describe the paper is structured as follows. experiments done to find the best approach. Section VI contains an analysis of our approach and a comparison to other OCR engines. Finally, we provide conclusions and pointers towards future work.