# Abstract

Since 2001, Wikipedia is providing people around the world with information written by volunteers around the world, making Wikipedia the first free-as-in-freedom online encyclopedia in more then 200 languages. Over time, Wikipedia evolved to distribute the information more efficient. One of the improvements is WikiData, a database to unify and centralize language independent information, as currently, each article about the same topic is independent. To provide similar kind of information advance options, like distance calculation for geographic coordinates, data types were created for WikiData. There is no data type for mathematical formula yet though, as on the one hand, they fulfill the requirements and on the other hand, their properties do not match into already implemented data types. In our project as part of the “DBPRO: Datenbankprojekt” course of the Technical University Berlin, we define and implement this new data type for mathematical expressions and formulae.

# Introduction

The Wikidata project was launched in October 2012 by the Wikimedia Foundation with the goal to centralize and unify language independent information such as the population of a city in a well structured database. There are many kind of information with similar traits for different properties. Therefore several data types were implemented. Examples are “monolingual text” for language independent words like names or “time” for birthdays or other specific dates. These add more use cases for the data than just raw information storage.

Our task was 1) to get an overview about Wikidata and it’s technical implementation 2) to describe why it’s necessary to implement a data type for mathematical formulae 3) to specify the sub problems, technical properties of the new data type and how to create data with the new data type in Wikidata 4) to describe, how the new data type improves Wikidata 5) to plan the implementation 6) to implement the new data type

Our report gives a brief overview about WikiData and our related research about our task to implement a new data type for math formulae. In section 3, we describe the WikiData structure and in section 4, we analyze, what kind of information we see as “math formula”. Afterwards, we talk about the implementation choices we had to think about in section 5, the test environment we set up to check the output, our code produces, in section 6 and the implementation in section 7 and its review process in section 8. In the end, we analyze the community feedback in section 9 and the plans for the future in section 10.

# Wikidata structure

Wikidata is a key-value database, where the value consists of information or documents and additionally enables

1. linking to connect different Wikipedia articles in different languages about the same topic

2. info box data as summarized collection of information

3. lists, created via database queries, like “Return every item, which has a property with ”P2534“ as data type.”1

Explained on an example, each Wikipedia article about Berlin2 has a link to the Wikidata page Berlin3 The Wikidata page of Berlin is referred as item. The items have several values assigned to properties, providing detailed additional information, also known as “claim”, depending on the data type. That forms the key structures of Wikidata. In the following, each piece of the structure will be briefly explained.

## Item

Items are mappings of objects of the real world into Wikidata. In relation with Wikipedia does that mean that each Wikipedia article could be represented as an item in Wikidata. Items are unique, therefore used to link different Wikipedia articles, referring to the same objects.

Each item can have multiple properties with values for additional information, such as London as item with an amount of inhabitants, a capital city and geographic coordinates on earth as properties. Items can be also properties to other items. For example, London (value) is the capital (property) of England (item).

## Property

A property is equivalent to an attribute of an entity. It’s used to specify information belonging to an item. Each property has one data type, supporting values belonging to the item with a given structure. The combination of property and value in an item is called “claim”. Properties can be concretized by qualifiers and a source. A claim under reference to a source is called a “statement”. Items and their statements make up the core structure of the data management in Wikidata.

The overall relation between all components can be seen completely in Figure \ref{pic3}

(Figure 3)

## Data types

Wikidata allows data to be modeled from primitive or complex data types. For example, the data type “quantity” is made up of several decimals and an unit:

• amount: the quantity’s main value

• lowerBound: the quantity’s lower bound (1 by default)

• upperBound: the quantity’s upper bound (1 by default)

• unit: unit of measure item (empty for dimensionless values)

Besides the yearly costs of Berlins new airport, we can also add additional information thanks to the bounds, such as “500.000.000 +/- 50.000.000 Euro”.

There are eight other data types, in total 10 with quantity and mathematical expression, which are already implemented in Wikidata:

String is meant for language independent alignment of characters. ISBN and other types of codes belong there.

Monolingual text is used for names, which stays the same in each language.

URL are links directing to external pages such as official web pages and e-mail addresses.

Time is used for all kinds of date and their accuracy. Internally saved in the proleptic Gregorian format, they can be converted and displayed in different formats.

Global coordinate can be pairs of numbers, representing the degree of latitude and longitude or the position relative to a stellar object. The earth is used as the default starting point.

Item is also a property, referring to other existing items in WikiData to connect different Wikipedia objects.

Property refer similar to Item to other existing properties in Wikidata.

Common media are links to files saved in Wikimedia Commons, like videos or maps.

These nine are separated in two types, depending on how they are saved internally.

### Value type

The first kind of data types are value types, to which string, time, common media and global coordinates belong to. They have their own data structure, allowing high customization, which enables more advanced methods. Due to that, there is a huge code maintenance overhead, especially for new contributors.

### Property type

The rest are classified as property types. While they do have their own use cases, they are based on and stored as already existing data types, namely value types. URL and monolingual texts are property types which are based on the string value type. As they share some methods with their parent data type, property types are easy to maintain. While it is also possible to implement advanced methods for these kind of data type, the difficulty and overhead to do so is usually higher then implementing the data type as value type.

## Datatype components

Every data type consists of three components:

- Parser: transforms user’s input into a data value - Validator: checks data value on additional constrains - Formatter: transforms data value into a fitting output

Figure \ref{pic4} displays the functionality of the components for a given data type:

(Figure 4)

# Math formulae or mathematical expression

Based on Wikipedia, a formula refers to the general construct of a relationship between given quantities [while a math formula] is an entity constructed using the symbols and formation rules of a given logical language.“4 while a mathematical expression is a finite combination of symbols that is well-formed according to rules that depend on the context.”5. Mathematical expressions are currently saved via TeX strings in the source of Wikipedia articles and represented as pictures to viewers. They are marked and formatted with the $ tag. This Formatting from TeX string to a picture is part of an extention of Wikibase called “Math extension”. It enables the connection between Wikipedia and MathML, which enables formatting of the expression as a readable output for the users. The TeX string [itex]r = \frac{1}{2} d$ can be displayed as $$r = \frac{1}{2} d$$.

There was no data type in Wikidata yet, which could achieve the same display feature as the [itex] tag. Each formula was embedded in the Wikipedia source, where they were not language independent and had to be added or edited in each language separately. The explanation of parts of the formula also had to be done separately and there wasn’t a fixed schemata given to do so. A computer, and maybe even the user, couldn’t analyze and interpret the formula without reading and analysing the text.

## Necessarily of the data type

An implementation of a new data type for mathematical expression would allow formulae to be

• saved language independent and centralized. While the variables might vary in different languages, the meaning of them will always stay the same, which makes a formula language independent.

• validated upon input (via our Validator).

• formatted given the context it appears. On the one hand, the users can see a formula in a human adjusted output, regardless what kind of browser they use or where they live. On the other hand, while Wikipedia should present a human friendly format, WikiData can provide an output specified for computers.

Because of that, there are several options to evolve the usage of formulae. It should be possible to

- identify parts inside the TeX string of a formula with specifications of an item

- have a legend for each formula to interpret them uniform

- have queries like “Show me all formulae, where mass is used.”

Before adding the new data type, we had to make sure whether it’s useful enough on its own or another, already existing data type matches our needs. Being a language independent information another option besides implementing an own data type could have been to build an add-on to monolingual text or string though, as it’s already used for chemical formula.

Once added to Wikipedia, a formula usually doesn’t change frequently due to being proved by scientist beforehand and checked by certain Wikipedia users afterwards. Because of the limited scope of targets, the data type will only be relevant for a small portion of Wikipedia.

Mathematical formulae do offer a broader use case as an own data type though. Unlike chemical formulae, mathematical ones have variables and are used for calculations, which can be added to the data type in the future. Something similar is already possible with WolframAlpha6, a computational knowledge engine which allows the user to enter a formula and returns information such as a graph or zero. It is also possible to use WolframAlpha as a calculator with TeX String as input, a feature we also see in our future work with our data type. While it might be also possible in the add-on idea, it would create a huge overhead for every other use case.

# Implementation choices

Before we could actually start to implement the data type, we had to think about several implementation choices. Because formulae are already displayed in Wikipedia as formatted TeX-strings, we decided to keep it like that, because the users, who are already editing the pages, keep a familiar way of editing the information. In general, we had two big choices in implementing the data type: Either creating a new value type “math formulae” or a property type “math formulae” with “string” as value type. Because it would be difficult to change that afterwards without breaking already existing formulae, we had to choose this very early.

## Math formula as value type

The option of a new value type works similar to geographic coordinates. It would be modeled as a tuple with following elements:

- The TeX-String for displaying the formulae. The general formula for polynomials would be noted as f(x) = \sum\limits_{i = 0}^{n} a_i x^i and be displayed in Wikipedia like $$f(x) = \sum\limits_{i = 0}^{n} a_i x^i$$ - An array of variables and the link to the. respected item they belong to. In this example, $$a_i$$ would get a link to the item “coefficient” - Other meta information as string, regarding the formula. An example would be for a polynomial with degree $$n$$, the coefficient $$a_n$$ of $$x^n$$ is not allowed to be $$0$$ and $$n$$ is an element of $$\mathds{N}$$

Combined, the formula would look like this in Wikipedia:
$$f(x) = \sum\limits_{i = 0}^{n} a_i x^i, n \in \mathds{N}, a_n \neq 0$$
$$a_i$$ : coefficient

The input for that in Wikidata can be equivalent to geographic coordinates:
f(x) = \sum\limits_{i = 0}^{n} a_i x^i;
a_i : coefficient ;
n \in \mathds{N}, a_n \neq 0
Each row would represent an element of the tuple.

## Math formula as property type

Implementing the data type as property type with string as value type would have changed nothing in regards of the output. Instead of storing the information separately in a tuple though, all the information would be in the TeX-string. For the example from above, a possible notation would be
f(x) = \sum\limits_{i = 0}^{n} #a_i|coefficient# x^i,
n \in \mathds{N}, a_n \neq 0
On the one hand, we wanted our data type to easily handle complex tasks like easy calculation and linking variables and sub formulae to their respective items, on the other hand we had to make sure that our data type will be usable for Wikipedia users. After a discussion with Lydia Pintscher, the product manager of WikiData, and Daniel, a developer from Wikimedia Foundation, who works on data types in WikiData, we decided to choose the property type version. Maintenance by outsiders is way easier this way and we did not found enough features to justify the overhead the value type implementation would bring. Before we could start to implement the data type though, we needed something to see, how its going to look like when deployed in Wikipedia.

# Test environment

To check, how our implementation will behave when deployed in the production environment, we needed something to simulate WikiData and Wikipedia, as we can’t work in the actual WikiData. We use a vagrant environment7 with wikibase8 and math extensions. Vagrant allows us to create our own WikiData environment on a server with the same code used in the actual WikiData with addition to our own one. It already gave us all the resources we needed and we could work on the same server instead of each of us in a local setup. Because the extensions are also used in WikiData, our implementation will behave very likely in the production environment. To check, how our code would behave in WikiData, we set up a proxy to monitor our outputs9. After we finished to set up our test environment (Figure \ref{vagrant}), we could start working on our task.

(Vagrant Figure)

# Basic implementation

The implementation process started on the 14th December. Following the whole WikiData project, we used php to implement our data type components. Unlike the other data types, the files for the data type were in the math extension. Our implementation contain 3 different parts: the hook, the Validator and the Formatter. Usually one has to implement a Parser to. Since we use Strings as input method though and use String as value type for our property type implementation, we can simply use the predefined String Parser to transform the input of the user to data values for our Validator and Formatter to process.

The hook (Figure \ref{hook} at the end of the report) works as the bridge between the math extension via the extension.json10 and the other data types in WikiData. It also links the components of the data type and is build similar to the other declarations, calling the respective functions to process the data. There are two declarations, one for the client, which belongs to Wikipedia, and one for the repository, which belongs to WikiData. Unlike the repository, where the information is processed and saved, the client only needs to be able to display the formulae.

The rdf builder, responsible for additional storage management, was added by another developer.

(Valdidate Code)

The Validator checks, whether our input is meeting our constrains. At first we check whether our given input is a string value for proper data value transformation. The second step is checking if the input is a valid TeX string. This is necessary, because we have to prevent getting an error while formatting the string. In future development, the Validator has to additionally check the meta data, whether they are properly inserted and match the related TeX string. Once the input passes the Validator, it can be saved in Wikidata and later be queried and returned via the Formatter.

The Formatter transforms our Tex string to some kind of output, depending on the use case It consists of two main functions: construct \ref{construct} and format \ref{format}.

(construct Code)

In Wikipedia and Wikidata, there are five different output types in total. To ensure our format function can distinguish, which kind of formatting to use, we have to load and store that information via our constructor (Figure \ref{construct}). Initially, we attempted to include that information directly in the format function, but due to the interface, we couldn’t do that. We chose to use Wikibase’s SnakFormatter formats, which can be viewed as a list containing every type of output environment for our Formatter, to have consistent results with the rest of Wikidata, although we are taking a higher risk of unpredictable errors, in case the SnakFormatter changes.

(format Code)

Depending on the format loaded in the construct function, the format function (Figure \ref{format}) processes the TeX string input to resolve that information to a human friendly output. The plain text is used for input boxes. While it just takes the input, as it is, it also needs to make sure that the input will not be changed.