ConceptNet 5.5: An Open Multilingual Knowledge Graph about Natural Language

Abstract

ConceptNet is a knowledge graph that connects words and phrases of natural language with labeled edges. It is designed to improve natural language applications by allowing the application to better understand the meanings behind the words people use. Version 5.5 extends its representation to include word forms in many languages. ConceptNet provides applications with understanding that they would not acquire from distributional semantics (such as word2vec) alone, nor from narrower resources such as WordNet or DBPedia. We demonstrate this with state-of-the-art results on intrinsic evaluations (word relatedness and analogies) that translate into improvements on an extrinsic evaluation in story understanding (the Story Cloze Test).

Introduction

ConceptNet is a knowledge graph that connects words and phrases of natural language (terms) with labeled, weighted edges (assertions). The original release of ConceptNet (Liu 2004) was intended as a parsed representation of Open Mind Common Sense (Singh 2002), a crowd-sourced knowledge project. This paper describes the release of ConceptNet 5.5, which has expanded to include lexical and world knowledge from many different sources in many languages.

In this paper, we will concisely represent an assertion as a triple of its start node, relation label, and end node: the assertion that "a dog has a tail" can be represented as (dog, HasA, tail).

Structure of ConceptNet

Knowledge sources

ConceptNet 5.5 is built from the following sources:

  • Facts acquired from Open Mind Common Sense (OMCS) (Singh 2002)
  • Information extracted from parsing Wiktionary, in multiple languages, with a custom parser ("Wikiparsec")
  • "Games with a purpose" designed to collect common knowledge, including Verbosity (Ahn 2006) in English, nadya.jp (Nakahara 2011) in Japanese, and the PTT Pet Game (Kuo 2009) in Chinese
  • Open Multilingual WordNet (Bond 2013), a linked-data representation of WordNet (Miller 1995) and its parallel projects in multiple languages, representing word meanings as "synsets" of synonymous word senses
  • JMDict (Breen 2004), a Japanese-multilingual dictionary
  • OpenCyc (Matuszek 2006), a hierarchy of hypernyms provided by Cyc, a system that represents common sense knowledge in predicate logic
  • A subset of DBPedia (Auer 2007), a network of facts extracted from Wikipedia infoboxes

With the combination of these sources, ConceptNet contains over 21 million edges and over 8 million nodes. Its English vocabulary contains approximately 1,500,000 nodes, and there are 83 languages in which it contains at least 10,000 nodes.

The largest source of input for ConceptNet is Wiktionary, which provides 18.1 million edges and is mostly responsible for its large multilingual vocabulary. However, much of the character of ConceptNet comes from OMCS and the various games with a purpose, which express many different kinds of relations between terms, such as PartOf ("a wheel is part of a car") and UsedFor ("a car is used for driving").