Introduction
The primary purpose of this research project is to identify and analyse the factors that determine the quality of high school education in Medellin between 2004 and 2014. Its anticipated results from this research project will enable policymakers to formulate policies to improve the quality of education and vocational training in science, technology and innovation.
This report presents the work assigned to Polymaths. The Company were assigned the task of designing a computational model that allows policymakers to simulate scenarios, which affect the quality of high school education, and its formational relationship with with science, technology and innovation.
Introduction
There has been much discussion about how Big Data can realise the ideal of evidenced-based policy. However achieving it is not as straightforward as it seems. While governments can generate and collate vast amounts of data, making use of it requires more than simplistic data analysis, rather it requires sophisticated mathematical and statistical methods.
Mathematical and statistical methods determine the kind of questions we can ask and find answers to. So if we just wanted to know the average age of the student population this would be relatively straightforward using standard data analysis/descriptive statistics. However if we wanted to ask the question will a particular section of the student population attain educational success then we couldn’t answer this using this approach, and would need to find a different way to ask such probing questions.
This is one of the main challenges presented by Big Data for policymakers. This is because:
- We have a breadth of data but the data is shallow. So we have a breadth of data covering many categories, such as births, tax records, marriage etc. But there are few observations per category. So for example we may have the birth record of a citizen, but we may not know the vocations of the parents, or the baby’s birth weight etc.
- Data is in silos
- Statistical correlations don’t explain why something is happening
We have a breadth of data about citizens derived from governmental records such as births, tax records, marriage etc. However while there is a breadth of data, it often comes in a raw form, difficult to interpret and thus not very useful to help the decision making process. Simple exploratory analysis, like plotting the data on maps and highlighting significant correlations, or even just looking at the trend of some variables could be very useful in terms of understanding the big picture of the situation, and could definitely help in working out more informed policies.
What would be even more useful to know is the interplay between all of these variables, how they influence each other, what happens to a variable if another one is changed, etc. This will enable us to develop better policies, and also influence the behaviour of citizens.
A Solution: Tackling the Problem with Bayesian Networks
One of the tools that have recently got a great attention for the purpose of helping policymakers to do their job are Bayesian Networks or Belief Networks. These models provide us with a powerful way to derive insights about human behaviour, from data that is both incomplete and wide-ranging, and expert knowledge.
In the era of Big Data we have multiple heterogeneous data sources, but the information we often have is incomplete. This is evidenced by data of individual citizens; we have limited information about them as individual citizens, despite having a vast breadth of data about the entire population of a city or a country. This absence of information depth on each citizen means we cannot be certain how different citizens will behave from the limited amount of data available, since no two citizens are exactly alike, or exactly different.
However Bayesian Networks allow us to integrate expert knowledge and data from multiple heterogeneous sources; thus they allow us to model the heterogeneity of citizen behaviour and update our knowledge (or our belief) on each variable. As a result they are able to capture the effect of the so complex human behaviour, and the behaviour of societies as in ways that cannot be achieved using classical statistical methods.
Bayesian models also enable us to update our knowledge on each variable, as we gather more evidence. This means we can update models in real-time as we learn more about each variable in the system; using machine-learning techniques we can generate Bayesian Networks to automatically estimate all parameters in the system.
The process of building this kind of model is very simple:
- Using the available data, we run algorithms to discover possible correlations between the variables. This will be the skeleton of our Bayesian Network.
- Then we use machine-learning algorithms to try and extract a possible set of causal relations.
- This network represents the set of belief extracted by the data. It is then checked by an expert to eliminate links that are known to be in one direction or another, to add obvious links and eliminate impossible ones. This is the most important phase, the one in which expert knowledge is combined with knowledge extracted from the data.
Given this network we can make simulations, changing one variable and looking at what happens to the others.
We can evaluate the effectiveness of public policies by predicting their effect; identifying the variables most likely involved in producing these effects.
Data
We have been provided with a very wide range of data. On one side, this is a great opportunity to discover interesting correlation and connection between variables that belong to different data sets, maybe owned by different institutions, and that have never been analysed together.