AMI-diagram: Mining Facts from Images
There are at least 10 million diagrams published in the scientific literature each year and many of them represent factual information. AMI-Diagram is a flexible tool which can mine facts from diagrams and convert the graphics primitives into XML. The targets include X-Y plots, barcharts, chemical structure diagrams and phylogenetic trees. AMI can ingest born-digital diagrams either as latent vectors (converted from Postscript), pixel diagrams (PNGs and JPEgs) or scanned documents. For high-quality/resolution diagrams the process is automatic; commandline parameters can be used for noisy or complex diagrams. AMI is part of the ContentMine framework (contentmine.org) for automatically extracting science from the published literature.
Over 1 million scientific articles are published yearly and a similar amount of theses and grey literature. Many contain diagrams, such as graphs or domain-specific objects, representing factual information and often this is the primary way of communicating the information contained (e.g. molecular structure diagrams). Almost all diagrams are now born digital (i.e. the output is written directly from a program to file). The originating programs include generic plotting packages (GNUPlot, R, Excel), specialist editors such as JChempaint or Chemdraw for molecules, or are generated directly from instruments (e.g. spectra). The plots are usually high resolution, either scalable vectors + text (such as SVG or Postscript derivatives) or large pixel maps, often between 1 million and 20 million pixels.
Since most scientific data is never published (estimates are often > 80% loss), extraction of data from images can be a vital source of semantic data. Traditional, labour intensive approaches include pencil and ruler, or cutting out peaks and weighing the paper and these are still, unfortunately, used today. Authors are reluctant to save data publicly; the Treebase database (http://treebase.org/treebase-web/home.html) of phylogenetic trees only contains 4% of published trees.
Converting a semantic object to vector or pixel graphics loses most of the information. However in some domains it is possible to combine computer vision technqiues with machine-learning or rules/heuristics to recover the likely generating object. Moreover, ambiguity can often be resolved by lookup against public semantic data (e.g. dbpedia.org) or recomputing the object. We have therefore developed image and vector processing technology which can reconstruct semantic data from a wide range of diagrams. Users may start with PDF documents, PNG or JPEG diagrams, or other sources of vectors (Word or Powerpoint EWF, PostScript, etc.). AMI is a work-in-progress being deployed to alpha-testers especially in chemistry and phylogenetics.
The overall process is:
There is often an advantage in knowing the style of a journal or generating program. Collaboration is very useful here and the AMI framework is developed so that users can add in plugins (AMI uses the Visitor pattern). A Visitor can be tailored to a specific journal or domain of science.
We have tried many methods including Hough line transforms, erosion (e.g BoofCV), and histogram equalisation. The following are the problems and approaches that we have found most appropriate for modern scientific articles. We warn that articles before ca. 2000 may have poor typography with less systematic presention, and this makes it harder to create simple heuristics.