# What is a Knowledge Graph?

Abstract

Google introduced its Knowledge Graph project in 2012, and has used it to improve query result relevancy and their overall search experience. They have leveraged existing knowledge graphs, such as DBpedia and Freebase, and also have opened up the process of contributing to the graph by ingesting RDFa and microdata formats from the Web pages they index, based on the vocabularies published by schema.org. The success of the Google Knowledge Graph, and its use of semantic technologies, has led to a resurgence in the use of the term in semantic research to describe similar projects. However, the term “knowledge graph” remains underspecified, and in many cases, simply refers to any directed labeled graph. We surveyed and synthesized current literature on knowledge graphs and the historical use of the term. The pre-Semantic Web conceptualization of knowledge graphs provides us with guidance as to what might currently “count” as a knowledge graph and also describes capabilities that do not yet exist in current knowledge graphs. From this synthesis, we propose an updated definition along with a set of knowledge graph requirements We include an implicit requirement: that knowledge graphs represent knowledge, as opposed to bare assertions with no justification or provenance. We discuss how knowledge graphs as defined are a crucial component of the future of the Web and have great potential for transformational change in data science and domain sciences.

# Introduction

Knowledge graphs provide an opportunity to expand our understanding of how knowledge can be managed on the Web and how that knowledge can be distinguished from more conventional Web-based data publication schemes such as Linked Data (Bizer 2009). In recent years knowledge graphs have grown increasingly prominent through commercial and research applications on the Web. Google was one of the first to promote a semantic metadata organizational model described as a “knowledge graph,” (Singhal 2012) and many other organizations have since used the term in the literature and in less formal communication. Our purpose with this paper is to provide an explicit description of the evolving notion of a knowledge graph, and further to lay out a potential impact spectrum. We review recent formal definition of knowledge graphs, knowledge graph analysis and construction algorithms, and popular commercial and research knowledge graphs in the literature. These new knowledge graphs do not strictly adhere to original knowledge graph theory (van de Riet 1992), but instead have followed a looser, more flexible definition. We present a more descriptive view of current, practical knowledge graphs, and discuss their potential for evolution and impact.

# Knowledge Graphs in Practice

Rospocher, et al. present knowledge graphs as collections of facts about entities, typically derived from structured data sources such as Freebase and (Rospocher 2016). They cite a dearth of event representations in current knowledge graphs as a shortcoming - limiting knowledge graphs to encyclopedic items such as birth and death dates - primarily due to the difficulty of obtaining temporal data about entities in a structured manner. Recent surveys such as those by Hogenboom, et al. (Hogenboom 2016) and Deng, et al. (Deng 2015) provide overviews of numerous methods for event extraction from a variety of sources including social media, news, academic publications, and even images and video, indicating that there is a great interest in finding ways to interpret and include such temporal data in a more structured format. Another review by Nickel et al. explores machine learning methods for knowledge graphs, but limits their definition to directed labeled graphs, with the ability to optionally pre-define the schema. They also review but do not take a position on the use of the closed versus open world assumptions.

van de Riet and Meersman (van de Riet 1992), Stokman and de Vries (Stokman 1988), and Zhang (Zhang 2002), present a formal theory of knowledge graphs as a specialization of semantic networks where meaning is expressed as structure, statements are unambiguous, and a limited set of relation types are used. These requirements also minimize redundancy within the knowledge graph, which simplifies analytical operations (including reasoning and queries). Popping explores the use of knowledge graphs and their challenges at the time in their use in network text analysis (Popping 2003). Following Zhang, Popping defines the knowledge graph as a type of semantic network that uses only a few types of relations, but also asserts that additional knowledge may be added to the graph.

More papers to consider: (Dieng 1992) (Vang 2013)

## Knowledge Graph Methods

Corby and Zucker present an abstract knowledge graph querying machine they call KGRAM (Corby 2010), but do not define knowledge graphs beyond being labeled directed graphs. This seems to be an abstraction of graph query methods and discusses how KGRAM is a generalization and extension of the RDF graph query language SPARQL (Harris 2013). Wang et al. (Wang 2014) discuss projecting generalized knowledge graphs into hyperplanes, but also only focuses on the labeled directed graph requirement of knowledge graphs. Pujara et al. use probabilistic soft logic (PSL) to manage uncertainty in knowledge graphs that have been extracted from uncertain sources (Pujara 2013). They argue that many current knowledge graphs do not always clearly identify entities, relying instead on labels that can be different due to spelling variations. Their task of “knowledge graph identification” has a goal of identifying a set of true assertions from noisy extractions. They do not claim to manage the provenance of the resulting knowledge graph assertions, however. Lin et al. attempt link prediction for automated knowledge graph construction but only rely on a directed labeled graph model of knowledge graphs (Lin 2015). Hakkani-Tur et al. use statistical language understanding to pose structured questions against the Freebase knowledge graph, focusing on improving the extraction of relation detection in the queries (Hakkani-Tur 2013). Benedek et al. have presendted a collaborative knowledge graph construction tool called “Conceptipedia”, building off of their “WikiNizer” project . This project uses visual mind mapping techniques and concept similarity analysis to suggest cross-knowledge graph mappings between collaborators. Weiderman and Kritzinger refer to knowledge graphs as a synonym for concept maps, but do not expand further on the topic, nor do they cite any work in knowledge graphs.

The Gene Ontology (GO) may be considered more of a knowledge graph than an ontology (Ashburner 2000). It embodies a hierarchy of biological processes, cellular locations, and molecular functions into which a number of genes and proteins have been classified or annotated. These annotations have been curated by domain experts, and the provenance of each is recorded using a GO-specific provenance encoding. YAGO (Yet Another Great Ontology) (Suchanek 2007) and YAGO 2 (Hoffart 2013) are considered by some researchers to be knowledge graphs, although each originated as a large, general-purpose ontology. While they aggregate knowledge from many sources, there are no published descriptions of whether or how provenance is tracked in YAGO and YAGO2.

The XLore system claims to be a fully bilingual (Chinese and English) knowledge graph that focuses on extracting subClassOf and instanceOf relations from free text (Wang 2013). SEKI@home is a crowd-sourced knowledge graph that aggregates from multiple sources (Steiner 2012), maintaining entity-level provenance using the PROV Ontology (Moreau 2015). This project also incorporates real-time matching against news articles (Steiner 2012a). The Knowledge Vault handles knowledge graph uncertainty as a result of automated fact extraction from Web pages (Dong 2014). DBPedia is a large-scale transformation of Wikipedia into a knowledge graph (Bizer 2009a). It uses a mostly fixed schema and provides provenance of which Wikipedia pages each entity was derived from. A number of biomedical knowledge graphs have been constructed from public databases, including Bio2RDF(Callahan 2013), Neurocommons (Ruttenberg 2009), and LinkedLifeData (Momtchev 2009). All three knowledge graphs provide dataset-level provenance.

## Commercial Knowledge Graphs

Freebase is a knowledge graph of over 3B facts and 58M topics 1 that is open to public access and curation (Bollacker 2008) and formed the basis for the Google Knowledge Graph, which augmented Freebase with knowledge gleaned from Google’s regular search engine crawls of the Web (Singhal 2012). Monteiro and Moura (Drumond Monteiro 2014) present a thoughtful analysis of the role of the Google Knowledge Graph as a realization of the Semantic Web vision (citation not found: bernerslee2000semantic) as Web 4.0, and show how it merges rule-oriented semantic analysis with statistical predictive approaches. Microsoft has also introduced a knowledge graph called “Satori” to enhance Bing search results (Qian 2013).

1. Freebase.com web site, April 2016

# A Definition of “Knowledge Graph”

One thing to note is that the knowledge graph platforms that have been reviewed in this paper do not strictly adhere to the definition of knowledge graph that was set out in Stokman and de Vries (Stokman 1988), and Zhang (Zhang 2002). Since usage has evolved it is appropriate to develop a definition that follows how the term is currently used. Implicit in the name “knowledge graph” is, of course, that a knowledge graph represent knowledge, and do so using a graph structure. Stokman, de Vries (Stokman 1988), and Zhang (Zhang 2002) posit useful definitions and requirements for knowledge graphs as a starting point:

• Knowledge graph meaning is expressed as structure.

• Knowledge graph statements are unambiguous.

• Knowledge graphs use a limited set of relation types.

In order for knowledge graphs statements to be unambiguous, they need to be composed of unambiguous units.

• All identified entities in a knowledge graph, including types and relations, must be identified using global identifiers with unambiguous denotation.

One example of this kind of identifier is the Uniform Resource Identifier (URI) as used in the Resource Description Framework (RDF) (Cyganiak 2014). While the use of “limited set of relation types” addressed a specific set of non-decomposable relations above, in the context of an open world knowledge system this should be taken to mean a core set of relations and classes that subsume or can be used to compose any other key relations and classes. This seems to be the case generally, as the reviewed knowledge graphs all attempt to build from a common vocabulary.

In practice, the knowledge graph literature and the practical knowledge graphs we reviewed either aggregate knowledge from many secondary sources and use Natural Language Processing (NLP) extraction when the sources are unstructured text, or use a semantic Extraction Transformation, and Load (ETL) process from structured databases (McCusker 2009). Some knowledge graphs rely on crowdsourcing of their information (including the Google Knowledge Graph), a form of distributed curation. At no point do we see a case where the knowledge does not have a theoretical, citeable source or some other recorded justification. Since knowledge graphs nominally represent knowledge, we argue that some criteria for inclusion of content and its provenance should be encoded in the graph. This is especially true for knowledge graphs gathered from other sources, as the sources themselves must have some justification for publishing their assertions.

• Knowledge graphs must include explicit provenance.

In many cases, the justification for inclusion of assertions appeals to authority, through the citation of the resource the knowledge was extracted from. Authority, at least in scientific research, is only a short cut for validating knowledge, and good knowledge graphs should encode as much justification for their assertions as they can. We consider graphs without provenance concerning attribution or justification to be bare statement graphs. Bare statement graphs are not true knowledge graphs, since they do not provide a way to confirm that assertions are justified or are even believed by their originators; this is a minimal (but not sufficient (Gettier 1963)) criteria for “knowledge” in a knowledge graph.

• Knowledge graphs may include uncertainty assessments.

Some knowledge graphs go further in modeling knowledge by providing uncertainty assessments of the knowledge asserted (Dong 2014). This can be useful when dealing with scientific knowledge graphs, where competing hypotheses and theories are known to be true to certain degrees, which may change as new evidence comes to light.

# Future Potential

In the literature knowledge graphs are not (usually) distinguished from bare statement graphs, in that they do not encode or publish the epistemology 1 of knowledge asserted in the graph. We see this as troubling because it does not privilege knowledge: in most existing knowledge graphs supported and unsupported assertions are given equal weight. Moving forward, there is an opportunity to leverage existing vocabularies, including the Provenance Ontology (PROV-O) (Moreau 2015), and the Nanopublications Framework (Groth 2010), to improve the clarity and utility of knowledge graphs. A nanopublication is a set of RDF graphs: an assertion graph (the knowledge), a provenance graph (the justification), and an attribution graph (the believer). While justified true belief is not sufficient for knowledge, most other proposals, including a causal linkage between the justification, assertion, and believer, are well-supported within provenance vocabularies. Added to a knowledge graph, the provenance graph can expand to provide room for whatever epistemic criteria is desired.

There is an interesting overlap between what is considered a “knowledge graph” and what is an ontology. The most commonly accepted definition of an ontology is a “an explicit specification of a conceptualization” (Gruber 1993). To a large degree, knowledge graphs conform to this definition, but generally ontologies tend to talk about generalities (classes, properties, and roles) with less focus on inclusion of content about specific instances. For example, most ontologies that include content related to descriptions of world landmarks would have descriptions of the landmark class and its related properties but would typically not include a mention of the Eiffel Tower, but a knowledge graph that covers the domain of Parisian landmarks would. Conversely, knowledge graph approaches can be used to improve the credibility of ontologies by encoding the epistemology of the statements in the ontology.

Ontology vs Knowledge Graph vs Data Graph?

1. Epistemology defines why something is known

# Conclusions

Knowledge graphs are a critical component of the Semantic Web and serve as information hubs for general use as well as for domain-specific applications. Most knowledge graphs seek to aggregate knowledge from third party sources, whether from external databases, from data aggregated though crawling the Web, or through the application of entity and relationship extraction methods. Knowledge graphs are not simply aggregations of RDF or linked data, but critically provide time-invariant information about entities of general interest. Their structures tend to be focused on a limited set of relations adhering to a coherent knowledge model, setting them apart from the linked data cloud in general, which usually has relied on the open framework of the Semantic Web to accommodate a completely free-form use of vocabularies and ontologies. Although some knowledge graphs track the provenance of their content, rigorous provenance is by no means a universal characteristic. We argue that knowledge graphs should prioritize the epistemology of the knowledge it contains – how we know what we know – and that Nanopublications are a suitable framework in which to do so. Semantic publishing that does not provide a level of statement epistemology can be considered “Bare Statement” graphs. Since so many knowledge graphs are curated from third parties, and because of the nature of publishing on the Web (Anyone can say Anything about Any subject), as knowledge graphs increase in popularity it will become critical to avoid use of such “Bare Statement” graphs.

### References

1. Christian Bizer, Tom Heath, Tim Berners-Lee. Linked data-the story so far. Semantic Services, Interoperability and Web Applications: Emerging Concepts 205–227 (2009).

2. Amit Singhal. Introducing the knowledge graph: things, not strings. Official Google Blog, May (2012). Link

3. RP van de Riet, RA Meersman. Knowledge Graphs. 97 In Linguistic Instruments in Knowledge Engineering: Proceedings of the 1991 Workshop on Linguistic Instruments in Knowledge Engineering, Tilburg, the Netherlands, 17-18 January 1991. (1992).

4. Marco Rospocher, Marieke van Erp, Piek Vossen, Antske Fokkens, Itziar Aldabe, German Rigau, Aitor Soroa, Thomas Ploeger, Tessel Bogaard. Building event-centric knowledge graphs from news. Web Semantics: Science, Services and Agents on the World Wide Web (2016). Link

5. Frederik Hogenboom, Flavius Frasincar, Uzay Kaymak, Franciska de Jong, Emiel Caron. A Survey of event extraction methods from text for decision support systems. Decision Support Systems (2016). Link

6. J. Deng, F. Qiao, H. Li, X. Zhang, H. Wang. An Overview of Event Extraction from Twitter. 251-256 In Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), 2015 International Conference on. (2015). Link

7. Frans N. Stokman, Pieter H. de Vries. Structuring Knowledge in a Graph. 186–206 In Human-Computer Interaction. Springer Science + Business Media, 1988. Link

8. Lei Zhang. Knowledge graph theory and structural parsing. Twente University Press, 2002.

9. R. Popping. Knowledge Graphs and Network Text Analysis. Social Science Information 42, 91–106 SAGE Publications, 2003. Link

10. Rose Dieng, Alain Giboin, Paul-André Tourtier, Olivier Corby. Knowledge acquisition for explainable multi-expert, knowledge-based design systems. 298–317 In Current Developments in Knowledge Acquisition EKAW 92. Springer Science $$\mathplus$$ Business Media, 1992. Link

11. Katrine Juel Vang. Ethics of Googles Knowledge Graph: some considerations. J of Inf Com & Eth in Society 11, 245–260 Emerald, 2013. Link

12. Olivier Corby, Catherine Faron Zucker. The KGRAM Abstract Machine for Knowledge Graph Querying. In 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. Institute of Electrical & Electronics Engineers (IEEE), 2010. Link

13. Steve Harris, Andy Seaborne, Eric Prud’hommeaux. SPARQL 1.1 query language. W3C Recommendation 21 (2013).

14. Zhen Wang, Jianwen Zhang, Jianlin Feng, Zheng Chen. Knowledge Graph Embedding by Translating on Hyperplanes. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence. (2014).

15. Jay Pujara, Hui Miao, Lise Getoor, William Cohen. Knowledge Graph Identification. 542–557 In Lecture Notes in Computer Science. Springer Science + Business Media, 2013. Link

16. Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, Xuan Zhu. Learning Entity and Relation Embeddings for Knowledge Graph Completion.. 2181–2187 In AAAI. (2015).

17. Dilek Hakkani-Tur, Larry Heck, Gokhan Tur. Using a knowledge graph and query click logs for unsupervised learning of relation detection. In 2013 IEEE International Conference on Acoustics Speech and Signal Processing. Institute of Electrical & Electronics Engineers (IEEE), 2013. Link

18. Michael Ashburner, Catherine A. Ball, Judith A. Blake, David Botstein, Heather Butler, J. Michael Cherry, Allan P. Davis, Kara Dolinski, Selina S. Dwight, Janan T. Eppig, Midori A. Harris, David P. Hill, Laurie Issel-Tarver, Andrew Kasarskis, Suzanna Lewis, John C. Matese, Joel E. Richardson, Martin Ringwald, Gerald M. Rubin, Gavin Sherlock. Gene Ontology: tool for the unification of biology. Nature Genetics 25, 25–29 Nature Publishing Group, 2000. Link

19. Fabian M. Suchanek, Gjergji Kasneci, Gerhard Weikum. Yago. In Proceedings of the 16th international conference on World Wide Web - WWW 07. Association for Computing Machinery (ACM), 2007. Link

20. Johannes Hoffart, Fabian M. Suchanek, Klaus Berberich, Gerhard Weikum. YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia. Artificial Intelligence 194, 28–61 Elsevier BV, 2013. Link

21. Zhigang Wang, Juanzi Li, Zhichun Wang, Shuangjie Li, Mingyang Li, Dongsheng Zhang, Yao Shi, Yongbin Liu, Peng Zhang, Jie Tang. Xlore: A large-scale english-chinese bilingual knowledge graph. 121–124 In Proceedings of the 2013th International Conference on Posters & Demonstrations Track-Volume 1035. (2013).

22. Thomas Steiner, Stefan Mirea. SEKI@ home, or Crowdsourcing an Open Knowledge Graph. 7 In Proceedings of the First International Workshop on Knowledge Extraction and Consolidation from Social Media (KECSM2012). (2012).

23. Luc Moreau, Paul Groth, James Cheney, Timothy Lebo, Simon Miles. The rationale of PROV. Web Semantics: Science Services and Agents on the World Wide Web 35, 235–257 Elsevier BV, 2015. Link

24. Thomas Steiner, Ruben Verborgh, Raphaël Troncy, Joaquim Gabarró Vallés, Rik Van de Walle. Adding Realtime Coverage to the Google Knowledge Graph. In Poster and Demo Proceedings of the 11th International Semantic Web Conference. (2012). Link

25. Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy, Thomas Strohmann, Shaohua Sun, Wei Zhang. Knowledge vault. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD 14. Association for Computing Machinery (ACM), 2014. Link

26. Christian Bizer, Jens Lehmann, Georgi Kobilarov, Sören Auer, Christian Becker, Richard Cyganiak, Sebastian Hellmann. DBpedia - A crystallization point for the Web of Data. Web Semantics: Science Services and Agents on the World Wide Web 7, 154–165 Elsevier BV, 2009. Link

27. Alison Callahan, José Cruz-Toledo, Peter Ansell, Michel Dumontier. Bio2RDF Release 2: Improved Coverage Interoperability and Provenance of Life Science Linked Data. 200–212 In The Semantic Web: Semantics and Big Data. Springer Science + Business Media, 2013. Link

28. A. Ruttenberg, J. A. Rees, M. Samwald, M. S. Marshall. Life sciences on the Semantic Web: the Neurocommons and beyond. Briefings in Bioinformatics 10, 193–204 Oxford University Press (OUP), 2009. Link

29. Vassil Momtchev, Deyan Peychev, Todor Primov, Georgi Georgiev. Expanding the pathway and interaction knowledge in linked life data. In In Proc. of International Semantic Web Challenge. (2009).

30. Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, Jamie Taylor. Freebase. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data - SIGMOD 08. Association for Computing Machinery (ACM), 2008. Link

31. Silvana Drumond Monteiro, Maria Aparecida Moura. Knowledge Graph and Semantization in Cyberspace: A Study of Contemporary Indexes.. Knowledge Organization 41, 429 - 439 (2014).

32. R Qian. Understand Your World with Bing, bing search blog. (2013). Link

33. Richard Cyganiak, David Wood, Markus Lanthaler. RDF 1.1 concepts and abstract syntax. W3C Recommendation. Feb (2014).

34. James P McCusker, Joshua A Phillips, Alejandra Beltrán, Anthony Finkelstein, Michael Krauthammer. Semantic web data warehousing for caGrid. BMC Bioinformatics 10, S2 Springer Science + Business Media, 2009. Link

35. E. L. Gettier. Is Justified True Belief Knowledge?. Analysis 23, 121–123 Oxford University Press (OUP), 1963. Link

36. Paul Groth, Andrew Gibson, Jan Velterop. The anatomy of a nanopublication. Information Services and Use 30, 51–56 IOS Press, 2010.

37. Thomas R. Gruber. A translation approach to portable ontology specifications. Knowledge Acquisition 5, 199–220 Elsevier BV, 1993. Link