Julian edited untitled.tex  about 8 years ago

Commit id: 1aa155ee72262eae03dae8d8d87bae8e1bed297c

deletions | additions      

       

\section{Einleitung} \section{Abstract}  WikiData ist im Wesentlichen eine frei bearbeitbare Sammlung von strukturierten Daten in einer Datenbank. Ziel ist unter anderem, Wikipedia – aber auch andere Projekte – dahingehend zu unterstützen, dass bspw. sprachunabhängige Daten, wie die Einwohnerzahl von Berlin, zentral gespeichert und abgerufen werden können.   Die zugrundeliegende Struktur ist die Abbildung von Dingen aus dem Gegenstandsbereich der Wirklichkeit als sogenannte Items. Diese sind einmalig und beziehen sich idealer Weise auf einen Artikel eines Wikimediaprojekts. Sie sind mit beliebig vielen Behauptungen verknüpft, die einer Eigenschaft des Objekts / Items einen Wert zuweisen und somit Informationen formulieren. Behauptungen können durch einen Rang in ihrer Qualität bewertet und damit geordnet werden, sowie durch einen Qualifikator beschrieben und ergänzt werden. Eine Behauptung, die durch Quellen nachweislich belegt ist, wird als Aussage bezeichnet. Items und ihre Behauptungen / Aussagen bilden die Kernstruktur der Datenhaltung bei Wikidata.   Abbildung 1.1 veranschaulicht diesen Zusammenhang:  https://de.wikipedia.org/wiki/Wikidata#/media/File:Wikidata_statement_de.svg \section{Introduction}  Aus Abbildung 1.1 könnte abgeleitet werden, dass es sich bei den Werten um primitive Datentypen handelt. Tatsächlich ist es aber eine wichtige Eigenschaft von Wikidata, dass Daten häufig als zusammengesetzte Datentypen modelliert werden. So besteht bspw. der Datentyp ‚Geographic locations‘ aus mehreren Dezimalwerten und dem zugrundeliegenden Koordinatensystem.   Ziel dieses Projekts ist es, den Grundstein für einen neuen Datentypen zu legen, mit dem mathematische Formeln modelliert werden können. The Wikidata project was launched in October 2012 by the Wikimedia Foundation with the goal to centralize and unify language independent information such as the population of a city in a well structured database.\cite{WikidataIntro} %It is mainly supervised by Wikimedia Deutschland.  \cite{wikiDataIntro} %TODO: citation needed @@@Julian: permanent link to wikipedia article https://en.wikipedia.org/w/index.php?title=Wikidata&oldid=706179445  Im Folgenden werden wir einen Überblick über WikiData und dessen technischen Hintergründen geben und dann erläutern, was mathematische Formeln sind. There are many kind of information with similar traits for different properties. Therefore several data types were implemented. Examples are "monolingual text" for language independent words like names or "time" for birthdays or other specific dates. These add more use cases for the data than just raw information storage.  \subsection{Datentypen %TODO: call that datatypes that will make it clear to the target audience One kind of information were math formulae, because they need a special kind of display  in WikiData}  Items verfügen über Eigenschaften (properties), die in Verbindung mit Werten (values) zu Behauptungen werden. Ein Beispiel für eine Behauptung über das Item ‚Berlin‘ wäre:   Einwohnerzahl (property) ist 3.500.000 (value)  In Wikidata wird für jede Eigenschaft ein Datentyp festgelegt. Datentypen bestehen wiederum aus zusammengesetzten Typen und werden fest definiert. Dies impliziert, dass ein property einen ausreichend umfangreichen Planungsprozess durchlaufen muss, damit sichergestellt ist, dass es mit dem passenden Datentyp bestmöglich modelliert wird.   Im oben genannten Beispiel wäre der Datentyp Quantity (Mengeneinheit). Dieser setzt sich zusammen aus (Quelle: https://www.mediawiki.org/wiki/Wikibase/DataModel#Quantities) Wikipedia and could be used for actual calculations \footnote{Bug tracker: \url{https://phabricator.wikimedia.org/T67397}}. @@@Julian: Ich weiss nicht genau, was Du meinst. Aber mach' einfach..  \begin{itemize}  \item amount: Our task was   1) to get an overview about Wikidata and it's technical implementation  2) to describe why it's necessary to implement a data type for mathematical formulae  3) to specify  the quantity's main value  \end{itemize}  \begin{itemize}  \item lowerBound: sub problems, technical properties of  the quantity's lower bound (1 by default)  \end{itemize}  \begin{itemize}  \item upperBound: new data type and how to create data with  the quantity's upper bound (1 by default)  \end{itemize}  \begin{itemize}  \item unit: unit of measure item (empty for dimensionless values)  \end{itemize} new data type in Wikidata  4) to describe, how the new data type improves Wikidata  5) to plan the implementation   6) to implement the new data type  wobei im Beispiel lediglich die dimensionslose Anzahl angegeben würde. Durch die Unter- und Obergrenze könnten aber auch temporäre Schwankungen einbezogen werden. Wenn beispielweise eine Schätzung der Einwohnerzahl zugrunde liegt, könnte mit dem Datentyp '3.500.000 +/- 50.000' u.ä. dargestellt werden. Die ‚Einheit‘ verweist im Beispiel vom Datentyp quantity auf eine physikalische Größe und ist nicht etwa ein String.  Neben dem Datentyp 'Menge' gibt es weitere Datentypen \cite{wikiDataList}: Our report gives a brief overview about Wikidata and our related research about the task, the plans we thought about before the implementation, our steps while implementing and our concepts for future development.  \textbf{Strings} sind sprachenunabhängige Zeichenketten. \section{Wikidata structure}  \textbf{Monolinguale Texte} werden für Namen verwendet, welche in allen Sprachen gleich sind. Beispiele hierfür sind Chemische Formeln oder wissenschaftliche, lateinische Namen. Wikidata is a key-value database, where the value consists of information or documents and additionally enables  \textbf{URL} sind Links, welche auf externe Seiten führen. Offizielle Webseiten, aber auch E-Mails, gehören dazu. - linking to connect different Wikipedia articles in different languages about the same topic  \textbf{Zeit} bezieht sich auf alle möglichen Zeitangaben und dessen Genauigkeit. Intern im Proleptisch Gregorianischen Format gespeichert können diese auch in anderen Formaten angezeigt werden. - info box data as summarized collection of information  \textbf{Globoale Koordinaten} können Paare von Zahlen sein, welche auf den Breiten- und Längengraden abgebildet werden, oder der Dezimalgrad zu einem stellarem Objekt. Als Ausgangspunkt wird in der Regel die Erde verwendet. - lists, created via database queries, like "Return every item, which has a property with "P2534" as data type."\footnote{Actual query:   \url{http://tinyurl.com/gljaylt}}  \textbf{Items} verweisen auf andere in WikiData existierenden Items. So wird der Geburtsort einer Person als ein Stadtitem gespeichert. Explained on an example, each Wikipedia article about Berlin\footnote{There are 229 entries for Berlin} has a link to the Wikidata page Berlin\footnote{Wikidata page of Berlin:   \url{https://www.Wikidata.org/wiki/Q64}} The Wikidata page of Berlin is refered as item. The items have several values assigned to properties, providing detailed additional information, also known as "claim", depending on the data type. In the following, each piece of the structure will be briefly explained.  \textbf{Eigenschaften} verhalten sich wie Items für Eigenschaften. \subsection{Item}  \textbf{Common Media} sind Links, welche zu Dateien führen, die in Wikimedia Commons gespeichert sind, darunter Videos, Bilder oder Karten. Items are mappings of objects of the real world into Wikidata, one of the key structures of Wikidata. In relation with Wikipedia does that mean that each Wikipedia article, representing an object, is an item.   They are unique, therefore used to link different Wikipedia articles, referring to the same objects.  Durch die Datentypen wird sichergestellt, dass Each item can have multiple  properties eindeutig interpretiert und sinnvoll verwendet werden können. Dies soll insbesondere eine automatisierte Verwertung der Informationen, allen voran durch Wikipedia, ermöglichen. with values for additional information.  \section{Mathematische Formeln} \subsection{Property}  Für mathematische Formeln existiert kein eigener Datentyp A property is equivalent to an attribute of an entity. It's used to specify information belonging to an item. Each property has one data type, supporting values belonging to the item with a given structure.   The combination of property and value  in Wikidata. Dies hat u.a. zur Folge, dass es keine Möglichkeit gibt, mathematische Formeln in Wikipedia einheitlich darzustellen, sodass sie ohne erklärende Angaben im Artikel interpretiert werden könnten. Formeln werden im Quelltext eines Wikipedia-Artikels als reiner Tex-String notiert, weshalb die Bestandteile einer Formel in jeder Sprache ausformuliert, definiert und erklärt werden müssen, um dem Betrachter möglichst umfangreich aufzuklären und Unklarheiten über die richtige Interpretation der Formel auszuräumen. Insbesondere automatisierte Interpretationen einer Formel sind somit de facto nicht möglich. an item is called "claim" and an item can . Figure~\ref{pic1} shows an example of the item "Deutschland" having a claim "Angela Merkel" (value) being germanys head of government (property).  \textit{Salopp formuliert}: Es ist in der Folge auch nicht möglich, mit den Formeln innerhalb der Wikimedia-Projekte Berechnungen durchzuführen. Erstrebenswert wäre eine Funktion, durch die Berechnungen möglich werden. Denkbar wäre bspw. eine Realisierung über eine Schnittstelle, die jeder Nutzer ansprechen kann, um aus einer Reihe von Werten (=Input) die zugehörigen Ergebnisse (=Output) mit der verlangten Formel zu erhalten. (Figure 1)  Properties can be concretized by qualifiers, a source. A claim being proven by a qualifier is called "statement" as shown in Figure~\ref{pic2}.  \subsection{Eigenschaften mathematischer Formeln}  - Beziehung (Relation) zwischen mathematischen, physikalischen oder ökonomischen Größen (Variablen oder Konstanten) (Figure 2)  - wirkt als Gesetzmäßigekeit, Regel, Vorschrift oder Definition Items and their statements make up the core structure of the data management in Wikidata.  - (benannte) Formeln haben eine (reale) inhaltliche Bedeutung, es sind nicht nur Bezeichnungen The overall relation between all components can be seen completely in Figure~\ref{pic3}  - Konstanten und Variablen immer mit einem Operator verbunden (Figure 3)  - Fokus auf einfachen, benannten Formeln (Satz des Pythagoras, Relativitätstheorie) \subsection{Data types}  \subsection{Eigener Datentyp für math. Formeln notwendig?}  - Alle mathematischen Formeln haben das gleiche Darstellungsschema Wikidata allows data to be modeled from primitive or complex data types.   For example, the data type "quantity" is made up of several decimals and an unit:  - Berechnung bei Übergabe von Werten (for amount:  the future) quantity's main value  - lowerBound: the quantity's lower bound (1 by default)  - upperBound: the quantity's upper bound (1 by default)  - unit: unit of measure item (empty for dimensionless values)  - Jedes Item hat mehr als nur eine Formel \Rightarrow ausreichende Verwendung Besides the yearly costs of Berlins new airport, we can also add additional information thanks to the bounds, such as "500.000.000 +/- 50.000.000 Euro".  \subsection{Vergleich mit vorhandenen Datentypen}  - Wie monolinguistische Texte sind mathematische Formeln in allen Sprachen gleich. Jedoch ist die Bezeichnung einer Variable in verschiedenen Gebieten verschieden, obwohl die gleiche Formel zugrunde liegt. Beispielsweise existiert für die Fliehkraft, welche in Formeln als $F_F$ bezeichnet wird, das Synonym Zentrifugalkraft, in Formeln als $F_Z$ bezeichnet wird. There are eight other data types, in total 10 with quantity and mathematical expression, which are already implemented in Wikidata:  Strings are language independent alignment of characters. ISBN and other types of codes belong there.  Monolingual texts are used for names, which stays the same in each language.  URL are links directing to external pages such as official web pages and e-mail addresses.  Time is used for all kinds of date and their accuracy. Internally saved in the proleptic Gregorian format, they can be converted and displayed in different formats.  Global coordinates can be pairs of numbers, representing the degree of latitude and longitude or the position relative to a stellar object. The earth is used as the default starting point.  Items refer to other existing items in Wikidata.  Properties refer to other existing properties in Wikidata.  Common media are links to files saved in Wikimedia Commons, like videos or maps.  These nine are separated in two types, depending on how they are saved internally.  \subsubsection{Value type}  The first kind of data types are value types, to which string, time, common media and global coordinates belong to. They have their own data structure, allowing high customization, which enables more advanced methods. Due to that, there is a huge code maintenance overhead, especially for new contributors.  \subsubsection{Property type}  The rest are classified as property types. While they do have their own use cases, they are based on and stored as already existing data types, namely value types. URL and monolingual texts are property types which are based on the string value type. As they share some methods with their parent data type, property types are easy to maintain. While it is also possible to implement advanced methods for these kind of data type, the difficulty and overhead to do so is usually higher then implementing the data type as value type.  \subsection{Datatype components}  Every data type is made of three components:  - Parser: transforms input into a data value  - Validator: check data value on additional constrains  - Formatter: transforms data value into a fitting output  Figure~\ref{pic4} displays the functionality of the components for a given data type:  (Figure 4)  \section{Math formulae}  A formula "refers to the general construct of a relationship between given quantities"\cite{formulaWikipedia} while a math formula "is an entity constructed using the symbols and formation rules of a given logical language." \cite{formulaWikipedia}  Mathematical expressions are currently saved via TeX Strings as pictures in Wikipedia. They are marked and formatted with the \verb|| tag.   It's part of an extention of Wikibase called "Math extension", which enables the connection between Wikipedia and MathML, which enables formatting of the expression as a readable output for the users. The TeX string \verb|r = \frac{1}{2} d| can be displayed as $r = \frac{1}{2} d$.  There was no data type in Wikidata yet, which could achieve the same display feature as the \verb|| tag. Each formula was embedded in the Wikipedia source, where they were not language independent and had to be added or edited in each language separately. The explanation of parts of the formula also had to be done separately and there wasn't a fixed schemata given to do so. A computer, and maybe even the user, couldn't analyze the formula without text analysis.  \subsection{Necessarily of the data type}  An implementation of a new data type for mathematical expression would allow formulae to be  - saved language independent and centralized  - validated upon input (via our Validator)  - formatted given the context it appears  Because of that, there are several option to evolve the usage of formulae. It should be possible to  - identify parts inside the TeX string of a formula with specifications of an item  - have a legend for each formula to interpret them uniform  - have queries like "Show me all formulae, where mass is used."  Before adding the new data type, we had to make sure whether it's useful enough on its own or another, already existing data type matches our needs.  Being a language independent information Another option besides implementing an own data type could have been to build an add-on to monolingual text or string though, as it's already used for chemical formula.  %TODO: Maybe you can add a table with pros and cos new datataype vs existing data type I like the example chimical formula very much.  Once added to Wikipedia, a formula usually don't change frequently due to being proved by scientist beforehand and checked by certain Wikipedia users afterwards.  Because of the limited scope of targets, the data type will only be relevant for a small portion of Wikipedia.  Mathematical formulae do offer a broader use case as an own data type.   Unlike chemical formulae, mathematical ones have variables and are used for calculations. These calculations would be implementable in the data type\footnote{\url{http://www.wolframalpha.com/} enables queries resulting in complex information about a formula and is also able to calculate with TeX string as input}. While it might be also possible in the add-on idea, it would create a huge overhead for every other use case There is no free accessible database containing all kinds of mathematical formulae yet. A separate data type where meta data is stored additionally to the formula enables that. Especially computers can look up all kind or formula way easier.  \section{Implementation choices}  Before we could actually start to implement the data type, we had to think about several implementation choices. Because formulae are already displayed in Wikipedia as formatted TeX-strings, we decided to keep it like that, because the users, who are already editing the pages, keep a familiar way of editing the information. In general, we had two big choices in implementing the data type: Either creating a new value type "math formulae" or a property type "math formulae" with "string" as value type. Because it would be difficult to change that afterwards without breaking already existing formulae, we had to choose this very early.  \subsection{Math formula as value type}  The option of a new value type works similar to geographic coordinates.   It would be modeled as a tuple with following elements:  - The TeX-String for displaying the formulae. The general formula for polynomials would be noted as \verb|f(x) = \sum\limits_{i = 0}^{n} a_i x^i| and be displayed in Wikipedia like $f(x) = \sum\limits_{i = 0}^{n} a_i x^i$  - An array of variables and the link to the. respected item they belong to. In this example, $a_i$ would get a link to the item "coefficient"  - Other meta information as string, regarding the formula. An example would be for a polynomial with degree $n$, the coefficient $a_n$ of $x^n$ is not allowed to be $0$ and $n$ is an element of $\mathds{N}$  Combined, the formula would look like this in Wikipedia:\\  $f(x) = \sum\limits_{i = 0}^{n} a_i x^i, n \in \mathds{N}, a_n \neq 0$\\ $a_i$ : coefficient\\ \\ The input for that in Wikidata can be equivalent to geographic coordinates:\\  \verb!f(x) = \sum\limits_{i = 0}^{n} a_i x^i;!\\ \verb!a_i : coefficient ;!\\ \verb!n \in \mathds{N}, a_n \neq 0!\\ Each row would represent an element of the tuple.  \subsection{Math formula as property type}  Implementing the data type as property type with string as value type would have changed nothing in regards of the output. Instead of storing the information separately in a tuple though, all the information would be in the TeX-string. For the example from above, a possible notation would be\\  \verb!f(x) = \sum\limits_{i = 0}^{n} #a_i|coefficient# x^i,!\\ \verb!n \in \mathds{N}, a_n \neq 0!\\ On the one hand, we wanted our data type to easily handle complex tasks like easy calculation and linking variables and sub formulae to their respective items, on the other hand we had to make sure that our data type will be usable for Wikipedia users. After a discussion with Lydia and Daniel from Wikimedia Foundation, we decided to choose the property type version. Maintenance by outsiders is way easier this way and we did not found enough features to justify the overhead the value type implementation would bring.  Before we could start to implement the data type though, we needed something to see, how its going to look like when deployed in Wikipedia.  \section{Test environment}  To check, whether our implementation does it work, we needed to have a way to simulate Wikidata and Wikipedia.  We used a vagrant  environment\footnote{\url{https://www.vagrantup.com/}}with wikibase and math extensions.  It already gave us all the resources we needed and we could work on the same server instead of each of us in a local setup. Because the extensions are also used in Wikidata, our implementation will behave very likely in the production environment. To check, how our code would behave in Wikidata, we set up a proxy to monitor our outputs\footnote{Link to our main example in our proxy:   \url{http://Wikidata-math-de.wmflabs.org/wiki/Q10}}.  After we finished to set up our test environment (Figure~\ref{vagrant}), we could start working on our task.  (Vagrant Figure)  \section{Basic implementation}  %TODO: Add project gant chart from the presentation.  The implementation started on the 14th December.  Following the whole project, we used php to implement everything.  Unlike the other data types, the files for the data type were in the math extension.  Our implementation contain 3 different parts: the hook, the Validator and the Formatter. Usually one has to implement a Parser to. Since we use Strings as input method though and use String as value type for our property type implementation, we can simply use the predefined String Parser.  (Valdidate Code)  The Validator checks, whether our input is meeting our constrains. At first we check whether our given input is a string value, preventing harmful code injections.  %TODO List those  The second step is checking if the input is a valid TeX string. This is necessary, because we have to prevent getting an error while formatting the string. In future development, the Validator has to additionally check the meta data, whether they are properly inserted and match the related TeX string.  Once the input passes the Validator, it can be saved in Wikidata and later be queried and returned via the Formatter.  The Formatter transforms our Tex string to some kind of output, depending on the use case It consists of two main functions:   construct~\ref{construct} and format~\ref{format}.  (construct Code)  In Wikipedia and Wikidata, there are five different output types in total. To ensure our format function can distinguish, which kind of formatting to use, we have to load and store that information via our constructor (Figure~\ref{construct}). Initially, we attempted to include that information directly in the format function, but due to the interface, we couldn't do that. We chose to use Wikibase's SnakFormatter formats, which can be viewed as a list containing every type of output environment for our Formatter, to have consistent results with the rest of Wikidata, although we are taking a higher risk of unpredictable errors, in case the SnakFormatter changes.  (format Code)  Depending on the format loaded in the construct function, the format function (Figure~\ref{format}) processes the TeX string input to resolve that information to a human friendly output.  The plain text is used for input boxes. While it just takes the input, as it is, it also needs to make sure that the input will not be changed.  (Value Add in Wikidata Figure)  To display the value in Wikidata, it uses HTML output. While there are three different kind of HTML outputs, we have not noticed any difference between them. We decided to return the processed string, so users can see the result they expect for the Wikipedia article.  (Value Output in Wikidata Figure)  For Wikipedia articles, we use the same output as for HTML but we had to change the way of processing the TeX string because the text in these articles use another format called x-wiki.  In the future, the format function also need to separate the TeX string and the meta information to create a fitting output.  As we implemented our code, we always had to keep in mind to be able to upgrade our implementation while information added in an early version will have the same output in future versions, namely backwards compatibility.  Our implementation was supposed to merge with the production environment on the 14th January. Because of the review process, it was postponed to the 9th February, which will be explained in the next section.  \section{Testing and developer review}  The whole review process regarding our implementation happened on two different platforms, gerrit and phabricator. While gerrit is focused to the code we produced, in phabircator, we discussed everything from the idea to the final product around the implementation \footnote{The whole discussion and everything related to that can be found at \url{https://phabricator.wikimedia.org/T67397}}  To ensure that we don't stray off into a wrong path midway, our code was reviewed several times while we worked on it.  Whenever we implemented something, we uploaded it on gerrit\footnote{\url{https://gerrit.wikimedia.org/r/\#/c/259167/}},  where our advisor and other Wikimedia developer could review our code and give us feedback. Through several Patch Sets, our code was refined and we were given some ideas, how to improve our code to match the other services in Wikidata. One of the comments led to the discussion, whether we want one format function, which internally chose the appropriate format output, or five different Formatter files to keep the code more simple and well separated. After some thought, we decided to take the single option, as the content wasn't complex enough to separate them.   For the Wikipedia user, it wouldn't have changed anything though and if we see the necessity to separate the format function into five, the user is not allowed not notice any of this.  While working on our basic implementation, we also implement our own tests to check, whether our functions give us the output or error we expect.  Before our implementation was merged into the production environment, it was tested in the beta cluster\footnote{Mainly tested on \url{http://Wikidata.beta.wmflabs.org/wiki/Q117940}} of Wikipedia.  In the beta cluster, multiple user and developer can check our implementation in an environment closer to the actual Wikipedia and Wikidata. At that stage, we had a discussion, how to name the new data type, as some developer thought, "math formula" was to restrictive. That discussion was started due to a missing message, we forgot to implement, because it wasn't necessary in our vagrant environment. In the end, everyone agreed on "Mathematical expression", suggesting more than just relations between entities. The name change caused a bug related with Java Script Caches for some Chrome user though, which caused a delay in the deployment. After we found a solution to the bug, we also had to make sure, no user will be affected by that bug anymore. Our implementation was available to use in the production environment on 9th February.  \section{Community impact}  Before the community can start to replace the formulae in the articles with Wikidata values, we first need at least one property with the data type Mathematical expression. Because if the dependencies of multiple items to one property, we can't simply add one as Wikidata has to make sure, everything stays nice, clean and working. On the one hand, we proposed to change the existing property TeX String\footnote{\url{https://www.Wikidata.org/wiki/Property:P1993}} from data type String to data type Mathematical expression, on the other hand, we proposed some other properties such as defining formula or probability density function\footnote{Both can be found in \url{https://www.Wikidata.org/wiki/Wikidata:Property_proposal/Natural_science}}.   Both were discussed within the community first before they were actually implemented by certain appropriate user into Wikidata.  On 15th February the first property with data type "Mathematical expression" was approved and enabled.  \section{Future development}  One question we are facing is where we want to add the meta information in our string.  Adding them in line will result in an intuitive mapping. It will make the string confusing though if there are a lot of them and when there are nested.  If the meta information is added separately outside the formula, it will make the formatting easier as the user can easily see, how the formula will look like in the end. The downside of that approach will be the huge obstacle the user will face, when trying to edit the formula, depending on how the mapping is implemented.