ROUGH DRAFT authorea.com/6440

# Introduction

This is a hand-in written by Group 60 (FIRE) for the weekly module #5 in the mathematical modelling course (DAT026) at Chalmers University of Technology.

Group 60 consists of:

• 901011-0279

• twingoow@gmail.com

• Program: IT

• Time spent: 19 hours, 53 minutes

• 920203-0111

• nicale@student.chalmers.se

• Program: IT

• Time spent: 27 hours, 19 minutes

We hereby declare that we have both actively participated in solving every exercise, and all solutions are entirely our own work.

Disclaimer: Please do note that the numbering of the sections in the TOC and elsewhere are not intended to follow the numbering of the problem. The problem referred to is always explicitly stated in the title itself (e.g: Problem 1...)

# Problem 1 - Model types of common programs & systems

Here we briefly discuss what model types the following common programs or systems are based on.

• A text model in a word processor can be modelled using semi structured data (SSD) such as XML. This can be done with a tree of string sequences. To process the paragraphs, sentences, and words, both the concept of regular languanges (and regular expressons for searching in the model) and the mathematical concepts that it entails can be used, as well as statistical analysis for checking if a word is particularly “non-swedish” or if a word combination is statistically improbable to have correct grammar. The statistical analysis can be done using markov chains.

• An outliner (structured text processor) at the most basic level can be modelled using a tree of items. To each item and depending on the items type, various states can be added to it such as “enabled/disabled”, etc. More advanced implementations of outliner software that enable viewing the various states of items in a tabular format will need to be able to represent the same data in a grid.

• The web is a series of tubes that can be modelled, assuming that it’s not the same thing as the internet and that the web primarily refers to the content of webpages on the internet, by taking advantage of the fact that hyperlinks are a form of pointers that together form a directed graph of websites as nodes. The edges of this graph can be given weights depending on how many times a page links to it. The more edges there are to a certain page, the more important that page is. This is analogous to the friends network on e.g Facebook or life in general. You can know someone without that person knowing you back, and algorithms for this problem ( the “celebrity problem” ) can be written. You can also know someone more, or less. Google use this type of modelling in their PageRank algorithm to enable searching the web.

• An image model in a graphics program is usually modelled either using raster graphics which uses a matrix/grid of pixels and their colors, or using vector graphics which don’t have any concept of pixels and thus can be scaled without losing any information. In vector graphics, the data can be paths, and various shapes including polygons. To model complex paths, splines can be used. In relation to raster graphics, vector graphics have very high information density.

• JPEG is a group of lossy compression algorithms for bitmap images which is particularly suitable for photographies because they contain a lot of blurring. Many lossy compression algorithm use blurring as a way to increase similarities in parts of an image and thus increase the rate of compression. A lossy compression is a compression where information is lost in the process. Whatever the technical details may be for JPEG - which we don’t have expertise in - it’s certain that it is either translatable to a grid/matrix or stored as such. The art of compression is related to how surprised we are: $$s(p) = -\log{p}$$ and statistical analysis.

• A database can be modelled in various ways depending on what type of database we are dealing with. If the database is relational (which most are) it can be modelled with relations between tuples and attributes of data using relational algebra, extending upon set theory, which provides the mathematical basis for relational databases. In terms of data-structures grids/tables/matrices are often another useful way of looking at RDBMS:es. In this light, rows are tuples, and columns are attributes. Using functional dependencies, and concepts of keys, constraints can be added to the database and the closures of relations can be computed mathematically. However not all databases are relational - some newer database formats such as JSON, XML are semi-structured data that is modelled using graphs where the data itself self-describing using for example tags.

• A travel search (e.g. västtrafik) can be modelled with a directed graph of the existing terminals (bus stops, etc.) with the travel times (walking, bus, tram, etc.) as the weight of the edges between nodes. This is a shortest-path problem that can be solved using Djikstra’s algorithm. The model needs to take more things into account, but this is the basic idea.

# Problem 2 - Rules or probability

Using Markov chains, we previously saw how we could first learn about characteristic properties of a given language by analysing texts given to us, i.e “How common is the letter A in English, and how often is an A followed by a Z?”, and then probabilistically asses what language a given word is written in.

Nothing should stop us from doing the same on a word level. We can ask questions about how likely it is that a given word is followed by another and establish a form of “probabilistic grammar” by searching for patterns and comparing languages, or for example predict what word the user will type next on a smartphone and save the user the hassle of typing it herself (“keyboard applications” for android such as Swiftkey and Fleksy do this all the time). We could also use the poem by Edgar Allan Poe we previously used as input and with this method get a much more natural-language like output.

Now let’s consider a probabilistic programming language based on markov chains. How would such a program behave if it runs at all? Well, the technical support departments phones would be rung down by people complaining about bugs non-stop. Probabilistic models using statistical analysis are not deterministic (in the sense that we always get the same output for the same input), and this is the essential difference. A code compiler or interpreter must be able to guarantee deterministic behavior.

For a programming language, using rule/syntax/grammar based language is fine since the rules are few and well defined. They can easily be represented with an Abstract Syntax Tree (AST) . But human languages is a different matter. There are idiomatic expressions, errors, etc. which is unfeasible to be captured by grammar.

What about cases a programmer makes a mistake and writes “whil” instead of “while”? In these cases a probabilistic model could opt to decide that these are so similar that the error could be rectified by a compiler - but this would lead to a slippery slope of bad practices and ambiguity. However, an IDE that is aware of the specific grammar could and should realize that the programmer wrote wrong and either autocorrect it or suggest a correction to the writer.