PREPRINT authorea.com/8535
Main Data History
Export
Show Index Toggle 1 comments
  •  Quick Edit
  • AMI-diagram: Mining Facts from Images

    Introduction

    There are at least 10 million diagrams published in the scientific literature each year and many of them represent factual information. AMI-Diagram is a flexible tool which can mine facts from diagrams and convert the graphics primitives into XML. The targets include X-Y plots, barcharts, chemical structure diagrams and phylogenetic trees. AMI can ingest born-digital diagrams either as latent vectors (converted from Postscript), pixel diagrams (PNGs and JPEgs) or scanned documents. For high-quality/resolution diagrams the process is automatic; commandline parameters can be used for noisy or complex diagrams. AMI is part of the ContentMine framework (contentmine.org) for automatically extracting science from the published literature.

    Background

    Over 1 million scientific articles are published yearly and a similar amount of theses and grey literature. Many contain diagrams, such as graphs or domain-specific objects, representing factual information and often this is the primary way of communicating the information contained (e.g. molecular structure diagrams). Almost all diagrams are now born digital (i.e. the output is written directly from a program to file). The originating programs include generic plotting packages (GNUPlot, R, Excel), specialist editors such as JChempaint or Chemdraw for molecules, or are generated directly from instruments (e.g. spectra). The plots are usually high resolution, either scalable vectors + text (such as SVG or Postscript derivatives) or large pixel maps, often between 1 million and 20 million pixels.

    Since most scientific data is never published (estimates are often > 80% loss), extraction of data from images can be a vital source of semantic data. Traditional, labour intensive approaches include pencil and ruler, or cutting out peaks and weighing the paper and these are still, unfortunately, used today. Authors are reluctant to save data publicly; the Treebase database (http://treebase.org/treebase-web/home.html) of phylogenetic trees only contains 4% of published trees.

    Overview

    Converting a semantic object to vector or pixel graphics loses most of the information. However in some domains it is possible to combine computer vision technqiues with machine-learning or rules/heuristics to recover the likely generating object. Moreover, ambiguity can often be resolved by lookup against public semantic data (e.g. dbpedia.org) or recomputing the object. We have therefore developed image and vector processing technology which can reconstruct semantic data from a wide range of diagrams. Users may start with PDF documents, PNG or JPEG diagrams, or other sources of vectors (Word or Powerpoint EWF, PostScript, etc.). AMI is a work-in-progress being deployed to alpha-testers especially in chemistry and phylogenetics.

    The overall process is:

    1. dissect and restructure PDFs and extract images.
    2. transform raw images into SVG.
    3. associate SVG with extracted captions to add semantics and classification.
    4. from the SVG primitives build domain-independent mid-level graphics objects (boxes, circles, grids, annotations, symbols, etc.)
    5. use domain-specific heuristics from the classification to create high-level semantic objects (x-y plots, molecular structures, phylogenetic trees, maps, etc.)

    There is often an advantage in knowing the style of a journal or generating program. Collaboration is very useful here and the AMI framework is developed so that users can add in plugins (AMI uses the Visitor pattern). A Visitor can be tailored to a specific journal or domain of science.

    Interpreting pixel maps

    We have tried many methods including Hough line transforms, erosion (e.g BoofCV), and histogram equalisation. The following are the problems and approaches that we have found most appropriate for modern scientific articles. We warn that articles before ca. 2000 may have poor typography with less systematic presention, and this makes it harder to create simple heuristics.

    1. Colours. Binary (black and white only) are simplest; gradients and dotted regions can cause problems. AMI separates colours into complementary pixel maps and can process each separately. Recombination is at the domain level (e.g. differently coloured subtrees).
    2. Noise (common in scanned documents), grayscales and antialiasing (very common) mean that background / threshold levels are sometimes critical. AMI can adjust these either from human control or a simple adaptive optimisation.
    3. Bleeding and cavitation. Graphics primitives which are close often "bleed" into a single object; faint primitives may have holes. Where glyphs interbleed we separate them heuristically (by comparion with target gylphs)
    4. Thinning. AMI reduces lines and strokes to single pixel width using the Zhang-Suen approach and then tidies some redundant pixels.
    5. Character recognition (OCR). Traditional OCR methods (machine learning, correlations, moments and Mahalanobis) don't work well with scientific characters which are often rotated, isolated, have variable fonts, italic and/or bold and cover a wide range of Unicode (maths, Greek, symbols). We have developed a topological approach which is robust to distortion and scaling and can be combined with classical methods (bitwise correlation).
    6. Separation of objects. We identify objects by floodfill or by expanding borders. Overlap of different colours is often tractable especially where these are primitives (lines, circles); we can sometimes resolve overlapping objects by creatng a dictionary of primitives (e.g. symbols).
    7. Segmentation. PDFs and pixels do not support higher level primitives and AMI uses Douglas-Peuckert segmentation to approximate curved strokes, where possible trying to fit them to circles.