Annotating phenotypes using ontological concepts: Inter-curator consistency as a baseline for evaluating the performance of a natural language processing system.


Organismal phenotypes are a principal object of study in multiple areas of biology, and so natural language statements about phenotypes are abundant in the biological literature. To render the semantics of such statements amenable to large-scale semantic reasoning by machines, it is necessary to extract the entities and relations from these statements and annotate them using ontology concepts. Annotating phenotype information in this way is normally done by human curators, and is a time consuming process strongly limited by the availability of subject experts. We have developed a software tool, Semantic CharaParser, with the aim of semi-automating the process of taking phenotype statements from published phylogenetic literature and expressing them in the form of Entity-Quality (EQ) annotations, in which a bearer entity, usually representing an anatomical structure, is described by a quality (such as a particular shape) and may have relations to other entities. To our knowledge, this is the first semi-automatic software prototype designed to generate formal EQ annotations. Here, we evaluate machine curation performance by using the consistency in annotations among human curators as a baseline. We use four ontology-based metrics to compute inter-curator consistency that consider both semantically identical and semantically similar matches. The consistency of Semantic CharaParser with curators is significantly lower (average 35% across the four metrics), and we explore what factors might explain the difference. We hypothesized that human curators took advantage of access to knowledge external to the text and so unavailable to Semantic CharaParser. However, we found that curators’ access to external knowledge did not improve inter-curator consistency, nor did it make human annotations differ more from machine curation. We did find that software performance was significantly enhanced (26% average improvement across the four metrics) after new ontology terms relevant to the input text had been added by human curators. These findings point toward ways to design Knowledge Extraction software for phenotype curation that can best complement and augment human curators.


Phenotypic descriptions of organisms are used in almost all areas of biological research including biomedical science, evolution, developmental biology, and paleobiology. The vast majority of such descriptions are expressed in the scientific literature using natural language. While allowing for rich semantics, natural language descriptions are opaque to machine reasoning, and thus hinder the integration of phenotypic information across different studies, taxonomic systems and branches of biology (Smith 2007).

To make phenotype descriptions more amenable to computation, model organism databases employ human curators to convert natural language phenotype descriptions into machine-readable phenotype annotations that employ standard ontologies, e.g. (Howe 2011, Bradford 2011, Bowes 2008, Blake 2009). Human curation takes advantage of human experts’ domain knowledge and their ability to resolve the inherent ambiguities in natural language. However, manual curation is extremely labor-intensive and few projects have the resources to curate all the relevant literature. This motivates the development of text mining and natural language processing (NLP) systems that can augment the work of human curators.

A standard format for phenotype annotations is the ontology-based Entity-Quality (EQ) representation (Mungall 2007, Mungall 2010), in which an entity represents a biological object such as an anatomical structure, an anatomical space, or a biological process, etc. and a quality represents a trait or property that an entity possesses, e.g, shape, color, or size (Table \ref{table1}). Other representations of phenotypic descriptions have been discussed by Loebe et al. \citeyear{loebe2012towards}. To create entities and qualities that adequately represent phenotypic descriptions, curators often create complex logical expressions called post-compositions by combining ontology terms, relations, and spatial properties in different ways. Owing to the complexity of creating post-composed entities and qualities, flexibility in ontological syntax, and different interpretations of semantics in phenotypic descriptions, EQ statements created by multiple curators are expected to have an inherent level of variability.

Curators converting natural language into EQ statements can not only draw on their expert knowledge of the domain but may also take advantage of external publications and other sources of information to resolve questions of interpretation. Phenotype descriptions in the literature are often brief, and there is the potential for varying interpretation of the authors’ original meaning based on the curator’s background and what resources are consulted. In contrast to a human curator, a software curation tool will only have access to the domain knowledge that is provided as input, typically the input ontologies and some form of learned lexicon. While the defined inputs and controlled method of software curation will tend to lead to less variable outcomes, it may be harder for software to achieve the same level of accuracy.

Ontology-based curation, human or machine, relies on the availability of appropriate concepts in the input ontologies used for curation. A human curator may recognize that a certain phenotype description requires an ontology term that is not currently available (citation not found: BALHOFF2014), a judgement that would be difficult to reliably automate in software.

Given the inherent variability in human curation, and the disadvantages that software systems have relative to humans for this task, it is informative to compare the outputs of machine curation to those of multiple, equally expert, human curators. A performance goal for a machine curation system would be to fall within the envelope of variability among human curators. Furthermore, by measuring inter-curator consistency, it is also possible to quantify the importance of external knowledge and ontology completeness.

Here, we present a semi-automated EQ semantic annotation software, Semantic CharaParser, that takes textual character descriptions of phenotypes from phylogenetic matrices and generates EQ statements that represent the phenotypic descriptions. During EQ semantic annotation, Semantic CharaParser recognizes pieces of text describing entities, qualities, and relations; performs concept recognition and ontology matching to identify appropriate term(s) in entity and/or quality ontologies; utilizes relations to create post-composed logical expressions to describe entities and qualities if necessary; and composes formal EQ annotations. Semantic CharaParser goes beyond the task of concept/entity recognition to create formal EQ statements through the association of entity and quality expressions appropriately.

For the reasons given above, we evaluate the performance of Semantic CharaParser relative to a set of human curators rather than to a single gold standard. We explore the role of ontology completeness and access to external knowledge in contributing to variability among human and machine curators. We also explore how much variability is reduced by requiring the annotation to include only an entity and not a quality. We measure inter-curator consistency using four ontology-aware metrics, two traditional semantic similarity metrics (Pesquita 2009) as well as extensions of Precision and Recall that account for partial similarities. Our measures contrast with those based on binary scoring in which annotations that are not perfectly identical would not be considered to have any similarity.


Examples of Entity-Quality annotations to phenotypes seen in this study. (A) illustrates a simple EQ annotation; (B) an EQ annotation in which the quality term relates two entities to each other; and (C) an entity that does not correspond to a term in an existing ontology but is instead a complex logical expression “post-composed”) from multiple ontology terms.
Character:State Entity Quality Related Entity
A. dorsal-fin rays: unbranched UBERON: dorsal fin lepidotrichium PATO: unbranched
B. nasal-prefrontal contact: present UBERON: nasal bone PATO: in contact with UBERON: prefrontal bone
C. lateral pelvic glands: absent in males UBERON: gland and (part_of some (BSPO:lateral region and (part_of some UBERON:pelvis and (part_of some UBERON:male organism)))) PATO: absent