Automatic Nutrition Extraction From Text Recipes


The science of nutrition deals with all the effects on people of any component found in food. This starts with the physiological and biochemical processes involved in nourishment — how substances in food provide energy or are converted into body tissues, and the diseases that result from insufficiency or excess of essential nutrients (malnutrition). The role of food components in the development of chronic degenerative disease like coronary heart disease, cancers, dental caries, etc., are major targets of research activity nowadays. There is growing interaction between nutritional science and molecular biology (esp. nutrigenomics) which may help to explain the action of food components at the cellular level and the diversity of human biochemical responses. Howvever in our daily lives we cook recipes made of ingredients, instead of focusing on raw food components. Beyond dietitians’ advice and guidelines, it’s difficult to continuously measure our daily nutritional intake, without manually entering weight and amount of each constituent ingredients. Apart from this manual process, effective nutritional intake also depends on the cooking process, retention factors of the individual ingredients. To alleviate such difficulties we propose an algorithm and an accompanying web-based tool to automatically extract nutritional information from any text-based recipes


Recipes show a tremendous amount of diversity in cooking styles and ingredients some of which are highly community or culture or even country-specific. This diversity makes it challenging to design a system which can infer nutritional information without much manual intervention and with substantial accuracy. Although it’s possible to manually enter each ingredient from an enormous database, it’s often time consuming and impractical in our day-to-day lives. To automatically deduce nutritional information from textual recipes we’ve segmented the core procedure into following steps

  • Information Extraction (IE) from text recipes, using Rule-based or NLP (Natural Language Processing) parser

  • Conversion to structured data - amount, unit, ingredient name and any modifiers (ex. “lightly beaten”)

  • Mapping of each ingredient to an existing food ontology (USDA Food Database is used for demonstrative purpose. It can be extended to other food databases like NUTTAB)

  • Deduction of weights from various lexical clues and ingredient densities

  • Deduction of final nutritional information and

Core Proedue \label{fig:core_procedure}


Information Extraction

Instead of adopting a full-ML approach, we’ve tried to capture linguistic rules by representing them using regular expressions. The general idea behind this technique is specifying regular expressions that capture certain types of information. For example, the expression (watched|seen) (NP), where (NP) denotes a noun phrase, might capture the names of movies (represented by the noun phrase) in a set of documents. By specifying a set of rules like this, it is possible to extract a significant amount of information. The set of regular expressions are often implemented using finite-state transducers which consist of a series of finite-state automata. We’ve used Citrus to declaratively define the parsing expression grammer (PEG) using the Ruby language. For example

PEG recipe IE \label{fig:peg_recipe_ie}

The tope level rule is defined by ingredient_line , which is, in turn, a composition of a series of other rules like quantity , unit and base_ingredient capturing the quantity, unit and ingredient name from each line of a typical recipe ingredient list. For example, unit is composed of following

PEG Composition \label{fig:peg_composition}

We’ve combined the gramer with NLP tools such as Part-Of-Speech (POS) taggers, noun phrase (NP) chunkers and stemmers/lemmatizers to make the extraction procedure more robust. Although we could have used a Machine Learning (ML) based parser, given the limited vocubulary of the recipe domain, it seemed an overkill.

Ontology Mapping

Ontology represents knowledge as a hierarchy of concepts within a domain, using a shared vocabulary to denote the types, properties and interrelationships of those concepts. In an approximate sense we can assume an existing food database as a Food Ontology (ex. Food can have classifications, relationships etc.). Using United States Department of Agriculture (USDA)’s National Nutrition Database Standard Reference as a standard food ontology, the primary aim for this part of the algorithm was to map ingredient input like 1 teaspoon vanilla to a specific node Vanilla extract ( in the database

Instead of a traditional approach, we’ve used open source search engine ElasticSearch’s text analysis capabilities for the ontology mapping.

ElasticSearch Text Analysis. Token Filter will be referred as filter subsequently \label{fig:text_analysis}

Entire food database from USDA is indexed using ElasticSearch and each food item is processed using the following analyzer

Recipe Ontology Mapping Analyzer \label{fig:recipe_ontology_mapping_analyzer}

Out of the various filters configured, food_synonym is most important for the mapping process. This filter uses a partially-autogenerated (Many food items in USDA database has common names) file filled with frequently occurring words in the recipe domain and their equivalent or synonymous words in the indexed data (USDA data). For example (using Solr synonym format)

Synonym File \label{fig:synonym_file}

Using the above file (food_synonym.txt ) the food_synonym filter is built as

Synonym Filter \label{fig:synonym_filter}

We use a Fuzzy Like This Query (FLTQ) using the raw ingredient obtained in the extraction process and a specific configuration of max_query_terms and fuzziness parameters. This query fuzzifies all terms provided as strings and then picks the best and differentiating terms. In effect this mixes the behaviour of FuzzyQuery and MoreLikeThis but with special consideration of fuzzy scoring factors. Instead of using a single “analyzed” index field, we also store an additional index field (description.simple ) which doesn’t process the tokenized stream using a snowball filter (stemmer). This increases the precision of our query and overall mapping process.

Multi-field Indexing for Accuracy \label{fig:multi_field_indexing_accuracy}

To optimize the overall mapping process we use ElasticSearch's Multi-Search API to map all ingredients of a given recipe to their respective food item nodes in the ontology.

Lexical Clues

After the parsing and mapping phase, one critical step is to determine the overall weight of each ingredient. This is perhaps the most complex step in the whole process and the subsequent nutritional calculation heavily depends upon this. This is complicated by the ingredient listings like

  • Pinch of salt to taste

  • Two 15-ounce cans chickpeas (4 cups), rinsed and drained

pinch is a very common kitchen unit and has to be appropriately handled for weight calculation. Similarly the second ingredient (chickpeas) has weight-hint given in it’s description (4 cups). Indentifying these lexical clues and incorporating them in the weight deduction is achieved in this step. This critical step is often overlooked in classical Information Extraction literature and discussed by (Badra 2011)

Nutritional Information

Recommended Dietary Intake (RDI) is consulted in order to display the accumulated values of various macro and micro-nutrients of the recipe. There are two complications

  • RDI values of some nutrients (ex. Cholesterol, Dietary Fiber etc.) depends on the total calorie intake

  • RDI values are complex functions of age (life-stage) and special conditions or deseases (ex. diabetic)

The proposed system gracefully handles all these and generates a more personalized nutritional annotation for the given recipe. For example, the Creme Brulee Oatmeal recipe has following nutritional profile

Nutritional Label for Recipe


  1. Fadi Badra, Sylvie Despres, Rim Djedidi. Ontology and lexicon: the missing link. 16–18 In Workshop Proceedings of the 9th International Conference on Terminology and Artificial Intelligence. (2011).

[Someone else is editing this]

You are editing this file