A Novel Machine Learning Based Approach for Retrieving Information from Receipt Images

AbstractIn this paper we are approaching, from a machine learning perspective, the problem of performing optical character recognition on receipt images and then extracting structured information from the obtained text. Tools that have not been trained specifically for this kind of images do not handle them well usually, because receipts have custom fonts and, due to size constraints, many letters are close to each other. In this paper we adapt existing methods for doing OCR, in order to achieve better performance than off-the-shelf commercial OCR engines and to be able to extract the most accurate information from receipts. Document layout analysis is performed on the receipts, then lines are segmented into characters using Random Forests and finally they are classified using Linear Support Vector Machines. We provide an experimental evaluation of the proposed approach, as well as an analysis of the obtained results.

Introduction

Optical character recognition (OCR) is the ability of the computer to extract textual information from image data, such as pixels(Schantz 1982). This is useful because by introducing the data available on paper, we can use a computer to process, index and search the data much faster.

OCR is a difficult problem, even when done on straight papers, without creases, that are scanned, because first we must identify letters on the page and distinguish them from tables, figures and other objects tha