DBPRO WikiData MathFormula

Abstract

Since 2001, Wikipedia is providing people around the world with information written by volunteers around the world, making Wikipedia the first free-as-in-freedom online encyclopedia in more then 200 languages. Over time, Wikipedia evolved to distribute the information more efficient. One of the improvements is WikiData, a database to unify and centralize language independent information, as currently, each article about the same topic is independent. To provide similar kind of information advance options, like distance calculation for geographic coordinates, data types were created for WikiData. There is no data type for mathematical formula yet though, as on the one hand, they fulfill the requirements and on the other hand, their properties do not match into already implemented data types. In our project as part of the “DBPRO: Datenbankprojekt” course of the Technical University Berlin, we define and implement this new data type for mathematical expressions and formulae.

Introduction

The Wikidata project was launched in October 2012 by the Wikimedia Foundation with the goal to centralize and unify language independent information such as the population of a city in a well structured database. There are many kind of information with similar traits for different properties. Therefore several data types were implemented. Examples are “monolingual text” for language independent words like names or “time” for birthdays or other specific dates. These add more use cases for the data than just raw information storage.

Our task was 1) to get an overview about Wikidata and it’s technical implementation 2) to describe why it’s necessary to implement a data type for mathematical formulae 3) to specify the sub problems, technical properties of the new data type and how to create data with the new data type in Wikidata 4) to describe, how the new data type improves Wikidata 5) to plan the implementation 6) to implement the new data type

Our report gives a brief overview about WikiData and our related research about our task to implement a new data type for math formulae. In section 3, we describe the WikiData structure and in section 4, we analyze, what kind of information we see as “math formula”. Afterwards, we talk about the implementation choices we had to think about in section 5, the test environment we set up to check the output, our code produces, in section 6 and the implementation in section 7 and its review process in section 8. In the end, we analyze the community feedback in section 9 and the plans for the future in section 10.

Wikidata structure

Wikidata is a key-value database, where the value consists of information or documents and additionally enables

  1. linking to connect different Wikipedia articles in different languages about the same topic

  2. info box data as summarized collection of information

  3. lists, created via database queries, like “Return every item, which has a property with ”P2534“ as data type.”1

Explained on an example, each Wikipedia article about Berlin2 has a link to the Wikidata page Berlin3 The Wikidata page of Berlin is referred as item. The items have several values assigned to properties, providing detailed additional information, also known as “claim”, depending on the data type. That forms the key structures of Wikidata. In the following, each piece of the structure will be briefly explained.

Item

Items are mappings of objects of the real world into Wikidata. In relation with Wikipedia does that mean that each Wikipedia article could be represented as an item in Wikidata. Items are unique, therefore used to link different Wikipedia articles, referring to the same objects.

Each item can have multiple properties with values for additional information, such as London as item with an amount of inhabitants, a capital city and geographic coordinates on earth as properties. Items can be also properties to other items. For example, London (value) is the capital (property) of England (item).

Property

A property is equivalent to an attribute of an entity. It’s used to specify information belonging to an item. Each property has one data type, supporting values belonging to the item with a given structure. The combination of property and value in an item is called “claim”. Properties can be concretized by qualifiers and a source. A claim under reference to a source is called a “statement”. Items and their statements make up the core structure of the data management in Wikidata.

The overall relation between all components can be seen completely in Figure \ref{pic3}

(Figure 3)

Data types

Wikidata allows data to be modeled from primitive or complex data types. For example, the data type “quantity” is made up of several decimals and an unit:

  • amount: the quantity’s main value

  • lowerBound: the quantity’s lower bound (1 by default)

  • upperBound: the quantity’s upper bound (1 by default)

  • unit: unit of measure item (empty for dimensionless values)

Besides the yearly costs of Berlins new airport, we can also add additional information thanks to the bounds, such as “500.000.000 +/- 50.000.000 Euro”.

There are eight other data types, in total 10 with quantity and mathematical expression, which are already implemented in Wikidata:

String is meant for language independent alignment of characters. ISBN and other types of codes belong there.

Monolingual text is used for names, which stays the same in each language.

URL are links directing to external pages such as official web pages and e-mail addresses.

Time is used for all kinds of date and their accuracy. Internally saved in the proleptic Gregorian format, they can be converted and displayed in different formats.

Global coordinate can be pairs of numbers, representing the degree of latitude and longitude or the position relative to a stellar object. The earth is used as the default starting point.

Item is also a property, referring to other existing items in WikiData to connect different Wikipedia objects.

Property refer similar to Item to other existing properties in Wikidata.

Common media are links to files saved in Wikimedia Commons, like videos or maps.

These nine are separated in two types, depending on how they are saved internally.

Value type

The first kind of data types are value types, to which string, time, common media and global coordinates belong to. They have their own data structure, allowing high customization, which enables more advanced methods. Due to that, there is a huge code maintenance overhead, especially for new contributors.

Property type

The rest are classified as property types. While they do have their own use cases, they are based on and stored as already existing data types, namely value types. URL and monolingual texts are property types which are based on the string value type. As they share some methods with their parent data type, property types are easy to maintain. While it is also possible to implement advanced methods for these kind of data type, the difficulty and overhead to do so is usually higher then implementing the data type as value type.

Datatype components

Every data type consists of three components:

- Parser: transforms user’s input into a data value - Validator: checks data value on additional constrains - Formatter: transforms data value into a fitting output

Figure \ref{pic4} displays the functionality of the components for a given data type:

(Figure 4)

Math formulae or mathematical expression

Based on Wikipedia, a formula refers to the general construct of a relationship between given quantities [while a math formula] is an entity constructed using the symbols and formation rules of a given logical language.“4 while a mathematical expression is a finite combination of symbols that is well-formed according to rules that depend on the context.”5. Mathematical expressions are currently saved via TeX strings in the source of Wikipedia articles and represented as pictures to viewers. They are marked and formatted with the <math> tag. This Formatting from TeX string t