What is a Knowledge Graph?


Google introduced its Knowledge Graph project in 2012, and has used it to improve query result relevancy and their overall search experience. They have leveraged existing knowledge graphs, such as DBpedia and Freebase, and also have opened up the process of contributing to the graph by ingesting RDFa and microdata formats from the Web pages they index, based on the vocabularies published by The success of the Google Knowledge Graph, and its use of semantic technologies, has led to a resurgence in the use of the term in semantic research to describe similar projects. However, the term “knowledge graph” remains underspecified, and in many cases, simply refers to any directed labeled graph. We surveyed and synthesized current literature on knowledge graphs and the historical use of the term. The pre-Semantic Web conceptualization of knowledge graphs provides us with guidance as to what might currently “count” as a knowledge graph and also describes capabilities that do not yet exist in current knowledge graphs. From this synthesis, we propose an updated definition along with a set of knowledge graph requirements We include an implicit requirement: that knowledge graphs represent knowledge, as opposed to bare assertions with no justification or provenance. We discuss how knowledge graphs as defined are a crucial component of the future of the Web and have great potential for transformational change in data science and domain sciences.


Knowledge graphs provide an opportunity to expand our understanding of how knowledge can be managed on the Web and how that knowledge can be distinguished from more conventional Web-based data publication schemes such as Linked Data (Bizer 2009). In recent years knowledge graphs have grown increasingly prominent through commercial and research applications on the Web. Google was one of the first to promote a semantic metadata organizational model described as a “knowledge graph,” (Singhal 2012) and many other organizations have since used the term in the literature and in less formal communication. Our purpose with this paper is to provide an explicit description of the evolving notion of a knowledge graph, and further to lay out a potential impact spectrum. We review recent formal definition of knowledge graphs, knowledge graph analysis and construction algorithms, and popular commercial and research knowledge graphs in the literature. These new knowledge graphs do not strictly adhere to original knowledge graph theory (van de Riet 1992), but instead have followed a looser, more flexible definition. We present a more descriptive view of current, practical knowledge graphs, and discuss their potential for evolution and impact.

Knowledge Graphs in Practice

Rospocher, et al. present knowledge graphs as collections of facts about entities, typically derived from structured data sources such as Freebase and (Rospocher 2016). They cite a dearth of event representations in current knowledge graphs as a shortcoming - limiting knowledge graphs to encyclopedic items such as birth and death dates - primarily due to the difficulty of obtaining temporal data about entities in a structured manner. Recent surveys such as those by Hogenboom, et al. (Hogenboom 2016) and Deng, et al. (Deng 2015) provide overviews of numerous methods for event extraction from a variety of sources including social media, news, academic publications, and even images and video, indicating that there is a great interest in finding ways to interpret and include such temporal data in a more structured format. Another review by Nickel et al. explores machine learning methods for knowledge graphs, but limits their definition to directed labeled graphs, with the ability to optionally pre-define the schema. They also review but do not take a position on the use of the closed versus open world assumptions.

van de Riet and Meersman (van de Riet 1992), Stokman and de Vries (Stokman 1988), and Zhang (Zhang 2002), present a formal theory of knowledge graphs as a specialization of semantic networks where meaning is expressed as structure, statements are unambiguous, and a limited set of relation types are used. These requirements also minimize redundancy within the knowledge graph, which simplifies analytical operations (including reasoning and queries). Popping explores the use of knowledge graphs and their challenges at the time in their use in network text analysis (Popping 2003). Following Zhang, Popping defines the knowledge graph as a type of semantic network that uses only a few types of relations, but also asserts that additional knowledge may be added to the graph.

More papers to consider: (Dieng 1992) (Vang 2013)

Knowledge Graph Methods

Corby and Zucker present an abstract knowledge graph querying machine they call KGRAM (Corby 2010), but do not define knowledge graphs beyond being labeled directed graphs. This seems to be an abstraction of graph query methods and discusses how KGRAM is a generalization and extension of the RDF graph query language SPARQL (Harris 2013). Wang et al. (Wang 2014) discuss projecting generalized knowledge graphs into hyperplanes, but also only focuses on the labeled directed graph requirement of knowledge graphs. Pujara et al. use probabilistic soft logic (PSL) to manage uncertainty in knowledge graphs that have been extracted from uncertain sources (Pujara 2013). They argue that many current knowledge graphs do not always clearly identify entities, relying instead on labels that can be different due to spelling variations. Their task of “knowledge graph identification” has a goal of identifying a set of true assertions from noisy extractions. They do not claim to manage the provenance of the resulting knowledge graph assertions, however. Lin et al. attempt link prediction for automated knowledge graph construction but only rely on a directed labeled graph model of knowledge graphs (Lin 2015). Hakkani-Tur et al. use statistical language understanding to pose structured questions against the Freebase knowledge graph, focusing on improving the extraction of relation detection in the queries (Hakkani-Tur 2013). Benedek et al. have presendted a collaborative knowledge graph construction tool called “Conceptipedia”, building off of their “WikiNizer” project . This project uses visual mind mapping techniques and concept similarity analysis to suggest cross-knowledge graph mappings between collaborators. Weiderman and Kritzinger \cite{} refer to knowledge graphs as a synonym for concept maps, but do not expand further on the topic, nor do they cite any work in knowledge graphs.

Academic Knowledge Graphs

The Gene Ontology (GO) may be considered more of a knowledge graph than an ontology (Ashburner 2000). It embodies a hierarchy of biological processes, cellular locations, and molecular functions into which a number of genes and proteins have been classified or annotated. These annotations have been curated by domain experts, and the provenance of each is recorded using a GO-specific provenance encoding. YAGO (Yet Another Great Ontology) (Suchanek 2007) and YAGO 2 (Hoffart 2013) are considered by some researchers to be knowledge graphs, although each originated as a large, general-purpose ontology. While they aggregate knowledge from many sources, there are no published descriptions of whether or how provenance is tracked in YAGO and YAGO2.

The XLore system claims to be a fully bilingual (Chinese and English) knowledge graph that focuses on extracting subClassOf and instanceOf relations from free text (Wang 2013). SEKI@home is a crowd-sourced knowledge graph that aggregates from multiple sources (Steiner 2012), maintaining entity-level provenance using the PROV Ontology (Moreau 2015). This project also incorporates real-time matching against news articles (Steiner 2012a). The Knowledge Vault handles knowledge graph uncertainty as a result of automated fact extraction from Web pages (Dong 2014). DBPedia is a large-scale transformation of Wikipedia into a knowledge graph (Bizer 2009). It uses a mostly fixed schema and provides provenance of which Wikipedia pages each entity was derived from. A number of biomedical knowledge graphs have been constructed from public databases, including Bio2RDF(Callahan 2013), Neurocommons (Ruttenberg 2009), and LinkedLifeData (Momtchev 2009). All three knowledge graphs provide dataset-level provenance.

Commercial Knowledge Graphs

Freebase is a knowledge graph of over 3B facts and 58M topics 1 that is open to public access and curation (Bollacker 2008) and formed the basis for the Google Knowledge Graph, which augmented Freebase with knowledge gleaned from Google’s regular search engine crawls of the Web (Singhal 2012). Monteiro and Moura (Drumond Monteiro 2014) presen