DBPRO WikiData MathFormula

Abstract

Since 2001, Wikipedia is providing people around the world with information written by volunteers around the world, making Wikipedia the first free-as-in-freedom online encyclopedia in more then 200 languages. Over time, Wikipedia evolved to distribute the information more efficient. One of the improvements is WikiData, a database to unify and centralize language independent information, as currently, each article about the same topic is independent. To provide similar kind of information advance options, like distance calculation for geographic coordinates, data types were created for WikiData. There is no data type for mathematical formula yet though, as on the one hand, they fulfill the requirements and on the other hand, their properties do not match into already implemented data types. In our project as part of the “DBPRO: Datenbankprojekt” course of the Technical University Berlin, we define and implement this new data type for mathematical expressions and formulae.

Introduction

The Wikidata project was launched in October 2012 by the Wikimedia Foundation with the goal to centralize and unify language independent information such as the population of a city in a well structured database. There are many kind of information with similar traits for different properties. Therefore several data types were implemented. Examples are “monolingual text” for language independent words like names or “time” for birthdays or other specific dates. These add more use cases for the data than just raw information storage.

Our task was 1) to get an overview about Wikidata and it’s technical implementation 2) to describe why it’s necessary to implement a data type for mathematical formulae 3) to specify the sub problems, technical properties of the new data type and how to create data with the new data type in Wikidata 4) to describe, how the new data type improves Wikidata 5) to plan the implementation 6) to implement the new data type

Our report gives a brief overview about WikiData and our related research about our task to implement a new data type for math formulae. In section 3, we describe the WikiData structure and in section 4, we analyze, what kind of information we see as “math formula”. Afterwards, we talk about the implementation choices we had to think about in section 5, the test environment we set up to check the output, our code produces, in section 6 and the implementation in section 7 and its review process in section 8. In the end, we analyze the community feedback in section 9 and the plans for the future in section 10.

Wikidata structure

Wikidata is a key-value database, where the value consists of information or documents and additionally enables

  1. linking to connect different Wikipedia articles in different languages about the same topic

  2. info box data as summarized collection of information

  3. lists, created via database queries, like “Return every item, which has a property with ”P2534“ as data type.”1

Explained on an example, each Wikipedia article about Berlin2 has a link to the Wikidata page Berlin3 The Wikidata page of Berlin is referred as item. The items have several values assigned to properties, providing detailed additional information, also known as “claim”, depending on the data type. That forms the key structures of Wikidata. In the following, each piece of the structure will be briefly explained.

Item

Items are mappings of objects of the real world into Wikidata. In relation with Wikipedia does that mean that each Wikipedia article could be represented as an item in Wikidata. Items are unique, therefore used to link different Wikipedia articles, referring to the same objects.

Each item can have multiple properties with values for additional information, such as London as item with an amount of inhabitants, a capital city and geographic coordinates on earth as properties. Items can be also properties to other items. For example, London (value) is the capital (property) of England (item).

Property

A property is equivalent to an attribute of an entity. It’s used to specify information belonging to an item. Each property has one data type, supporting values belonging to the item with a given structure. The combination of property and value in an item is called “claim”. Properties can be concretized by qualifiers and a source. A claim under reference to a source is called a “statement”. Items and their statements make up the core structure of the data management in Wikidata.

The overall relation between all components can be seen completely in Figure \ref{pic3}

(Figure 3)

Data types

Wikidata allows data to be modeled from primitive or complex data types. For example, the data type “quantity” is made up of several decimals and an unit:

  • amount: the quantity’s main value

  • lowerBound: the quantity’s lower bound (1 by default)

  • upperBound: the quantity’s upper bound (1 by default)

  • unit: unit of measure item (empty for dimensionless values)

Besides the yearly costs of Berlins new airport, we can also add additional information thanks to the bounds, such as “500.000.000 +/- 50.000.000 Euro”.

There are eight other data types, in total 10 with quantity and mathematical expression, which are already implemented in Wikidata:

String is meant for language independent alignment of characters. ISBN and other types of codes belong there.

Monolingual text is used for names, which stays the same in each language.

URL are links directing to external pages such as official web pages and e-mail addresses.

Time is used for all kinds of date and their accuracy. Internally saved in the proleptic Gregorian format, they can be converted and displayed in different formats.

Global coordinate can be pairs of numbers, representing the degree of latitude and longitude or the position relative to a stellar object. The earth is used as the default starting point.

Item is also a property, referring to other existing items in WikiData to connect different Wikipedia objects.

Property refer similar to Item to other existing properties in Wikidata.

Common media are links to files saved in Wikimedia Commons, like videos or maps.

These nine are separated in two types, depending on how they are saved internally.

Value type

The first kind of data types are value types, to which string, time, common media and global coordinates belong to. They have their own data structure, allowing high customization, which enables more advanced methods. Due to that, there is a huge code maintenance overhead, especially for new contributors.

Property type

The rest are classified as property types. While they do have their own use cases, they are based on and stored as already existing data types, namely value types. URL and monolingual texts are property types which are based on the string value type. As they share some methods with their parent data type, property types are easy to maintain. While it is also possible to implement advanced methods for these kind of data type, the difficulty and overhead to do so is usually higher then implementing the data type as value type.

Datatype components

Every data type consists of three components:

- Parser: transforms user’s input into a data value - Validator: checks data value on additional constrains - Formatter: transforms data value into a fitting output

Figure \ref{pic4} displays the functionality of the components for a given data type:

(Figure 4)

Math formulae or mathematical expression

Based on Wikipedia, a formula refers to the general construct of a relationship between given quantities [while a math formula] is an entity constructed using the symbols and formation rules of a given logical language.“4 while a mathematical expression is a finite combination of symbols that is well-formed according to rules that depend on the context.”5. Mathematical expressions are currently saved via TeX strings in the source of Wikipedia articles and represented as pictures to viewers. They are marked and formatted with the <math> tag. This Formatting from TeX string to a picture is part of an extention of Wikibase called “Math extension”. It enables the connection between Wikipedia and MathML, which enables formatting of the expression as a readable output for the users. The TeX string <math>r = \frac{1}{2} d</math> can be displayed as \(r = \frac{1}{2} d\).

There was no data type in Wikidata yet, which could achieve the same display feature as the <math> tag. Each formula was embedded in the Wikipedia source, where they were not language independent and had to be added or edited in each language separately. The explanation of parts of the formula also had to be done separately and there wasn’t a fixed schemata given to do so. A computer, and maybe even the user, couldn’t analyze and interpret the formula without reading and analysing the text.

Necessarily of the data type

An implementation of a new data type for mathematical expression would allow formulae to be

  • saved language independent and centralized. While the variables might vary in different languages, the meaning of them will always stay the same, which makes a formula language independent.

  • validated upon input (via our Validator).

  • formatted given the context it appears. On the one hand, the users can see a formula in a human adjusted output, regardless what kind of browser they use or where they live. On the other hand, while Wikipedia should present a human friendly format, WikiData can provide an output specified for computers.

Because of that, there are several options to evolve the usage of formulae. It should be possible to

- identify parts inside the TeX string of a formula with specifications of an item

- have a legend for each formula to interpret them uniform

- have queries like “Show me all formulae, where mass is used.”

Before adding the new data type, we had to make sure whether it’s useful enough on its own or another, already existing data type matches our needs. Being a language independent information another option besides implementing an own data type could have been to build an add-on to monolingual text or string though, as it’s already used for chemical formula.

Once added to Wikipedia, a formula usually doesn’t change frequently due to being proved by scientist beforehand and checked by certain Wikipedia users afterwards. Because of the limited scope of targets, the data type will only be relevant for a small portion of Wikipedia.

Mathematical formulae do offer a broader use case as an own data type though. Unlike chemical formulae