On the other hand, though, having this kind of data required a long work of cleaning that took away most of the time for this project, even considering the fact that we have been doing this with the help of Thot Group.
A thorough description of the data sets that were given to us is available in Spanish in the report of the Thot Group:  "Informe estadístico descriptivo". 
Here we limit ourselves to highlight what were the major issues, to make you aware of the data analysis pipeline for the project. We hope that this will make clear what are the required steps, and will help you plan accordingly for the continuation of this project.

Cleaning Data

One of the first step of a data-driven project is always cleaning the data. Cleaning the data means putting them in a format that is manageable with the models and software that will be used. This is usually the most time-consuming part of a project for several reasons. In fact, on average a data scientist spends 90% of their time to collate and clean data, and of course the percentage of this time was strongly dependant on how good the quality of data is at the beginning.
In this project, the data cleaning has followed a long process given the huge variety of data formats and other problems. First, people in Thot Group have cleaned the data putting them in a simpler text format, and then we had to do another bit of cleaning to put them in a format that is tractable by the model that we have built.
This is very important and a point that we want to make clear. Data cleaning is an important process of the work flow and it is not done within the model. This means that to be able to use the model data will need to be given in input in the right format.
.CVS file, comma separated, with headers.
In practice, it should be a big matrix where the first line contains the names of variables.

Issue of data quality

The main issues relative to the data that we have received are illustrated here. We report this because data quality is a very important factor in a project involving data. We report what was not good with them, with the purpose of informing the future choices you are going to make with regard to data selection and preparation. How explained before, having good data will significantly reduce the time needed to clean it. We want to remind you that the model is not usable if not provided with the data in the correct format.

Data integrity

Integrity of data is very important, because it is crucial in the choice of using or not using a particular data set. There might be several problems that would preclude us to use a data set:
  1. There might be missing data. In general, missing data is not a big problem, because there are techniques to deal with it. But this is just up to a certain point. Many of the data sets we received had variables with a significant amount of missing values - sometimes in the order of 50-70%. This might happen when no planning was made before starting to collect the data. This means that at some point in time, i.e. starting from a certain year, new variables pop up that were not there before. If we were to use these variables, we could only use the records (of rows) that are available for all the other variables. If you consider that this could happen for many different variables, you can see that this quickly reduces the amount of data that we can actually use.
  2. Names of variables. Sometimes names of variables don't make sense, are not intelligible and cannot be used because of this. We are trying to make a model which should be human-readable and intelligible, and if variables names do not make sense we have to exclude those variables. Sometimes two or more variables get merged into a single one that has a new name, or maybe even contains new different data. Without information about how this process has been done and how these choices have been made, again we have to exclude such variables.

Data coherence

These are much more common problems, but still significant ones and that can be reduced. Here the focus is on how we can put different data sets together in order to enable us to discover relationships that the single data sets could not possibly show.
Data sets are not coherent across different files for the following reasons:
  1.  Data formats. Data may have been stored in different formats such as excel spreadsheets, cvs files, text files, and may use features relative to the particular software that was used to produce them;
  2. Time. Some data sets have data for a certain period of time, and other data sets for another one. If they overlap this is good, but often the overlap is very small, of just one or few years. This, in general, means a very big reduction of the number of records (of rows) that we can use, limiting the power of the model or making it impossible to train.
  3. Granularity. Some data sets that we have been provided have information on a student level. Some of them on a school level. Some of them on a comuna level. Data on school level can be merged with data on a student level, provided we know what school that student went to. This is a feasible merging, even though all the student of a certain school will have the same values for certain variables because of this difference in granularity. The same goes for comunas. This can be used to discover new relationships among variables that were not contained in the original data set, but on a different level we are loosing accuracy. This is a tradeoff which is subject of discussion. Our recommendation is that we can go from the student to the school level but not much further, otherwise correlation could be very strong but for an artificial reason. 
  4. Key Variables. Following from the previous point, to merge two different data sets we need a key variable, some variables whose value allows us to link the two data sets. For example, if we want to merge a data set on a student level and one on a school level, the key variable is the school. The name of the school  is not suitable because of the way data are recorder. It is very likely that the same school is not written in the same way in two different data sets. There might be typos, there might be abbreviations. A better key variable is the DANE code. We found, during the cleaning process, that there are different kinds of DANE codes and that some data sets contain one type, some others contain a different one, and it took some time to merge these data sets. In this case the process has only been difficult. In some cases the process is not even possible. This is the case when we for example have two data sets on a student level. In general each of those will contain a sample of the student, and even if they contain all the student, in general they do not have an identifier, a key variable, that allows us to link the data about the same student in two data sets. If we wanted to do that, we would have needed to consistently use a unique identifier when collating those data sets.

The need for planning

The past decade has seen us producing an enormous amount of data, more than what was produced since the beginning of history until ten years ago. Of course, most of the times, these data were not collected for a particular purpose, or maybe different data sets,  but nobody planned of using them together with other data sets.
Here, now, in a data-driven society, in order to make informed data-driven choices, we cannot allow ourselves anymore to not know why we are collecting data. While historically the process was reversed, to make it efficient we must first know what we need and what we want to do with it. Only then we can collect the data, and then use it for our purposes.
Deciding what data to collect, then, requires careful planning. It also requires expertise both from people with expert knowledge in the field of policymaking, in particular in your case in Medellin or Colombia, but it also requires people with a more data-oriented skill set, that can see connections and technical problems/advantages where others can't.
If only we had more time, this is something that we would want to explore, and this might be the basis for a future project. Everything contained in this chapter was in fact a recommendation on how to act in the future with regard to this aspect in particular, in order to make your job more efficient and be able to produce more in less time.

Methods

Provide a conceptual understanding about Bayesian networks, don't go into a detailed explanation of the mathematics, since this isn't their area of expertise, and its not required for us to give an explanation of the mathematics, which they could find in a good text book. Rather give a sufficient explanation that they can understand easily.

Correlation Matrix

The correlation matrix is a very simple - visual - tool that allows us to understand with a glance what are the correlations among the variables in our system. An example is shown in Fig X.
On the rows and columns we have variables in our data set, in the same order. Since clearly each variable is completely correlated with itself, the diagonal shows a perfect correlation. But the off-diagonal correlations are those we are actually interested in.
Having a correlation between two variables does not, by all means, tell us which one is the cause and which one is the consequence. For example it does not tell us that the socio-economic status has a causality relationship with the scores at the Saber 11 exam. But we know that changes in one variables produce changes in the other one. If the correlation is positive, we can say more and say that if the value of one variable increases, so will the value of the other one.
There are algorithms that put close together variables that are strongly correlated to each other. This is usually in terms of visually seeing where correlations are. This is very helpful because it allows us to see that there are groups of variables that are strongly interconnected. For example, when we look at the variables representing the performance of students in various disciplines, we see clearly that they are all strongly correlated with each other. This is a good thing, because it makes sense, but in terms of running more complex algorithms, it will give us no more information than an aggregate of all these variables could. On the other side, though, it will significantly increase the complexity of our model.
It is also possible, with more complex clustering algorithms, to define a threshold in therms of how to group variables, so that we can automatically extrapolate what variables should we use for our further analysis.
The step of looking at the correlation matrix is just a first step. It tells us a good deal of information, but does not allow to do anything with it apart of acknowledging it and keeping it into account. More advanced algorithms allow us to discover more hidden and interesting relationships between the variables.

Bayesian Networks

Bayesian networks are a very interesting tool which constitutes the main portion of the analysis we performed. We want to emphasise how powerful is this tool for the purpose of helping policymakers to take data-driven decisions, and making an impact in  education and innovation. 

How the model is trained

Bayesian networks are an exceptional tool because they are not completely automatic, but the user can and should intervene in the process of its construction. The basic steps of the training process are as follows.
  1. Independence tests are performed between each couple of variables. This means calculating correlations and then putting a link  between to variables if they are sufficiently correlated.
  2. Then, there are algorithms to infer causality. Without entering too much in the details, a different kind of independence is tested, which is called conditional independence. This allows to establish the direction of causality for most links. The remaining ones are assigned with the requirement that the network does not contain loops. This is a step which is very important to understand, because it is the one in which major insights can be found. In fact, it might be that two variables are not significantly correlated - according, for example, to the correlation matrix - but they may be correlated through another variable. This would tell the user that those two variables must be carefully considered, because they show that even if two variables are considered to be independent, they can actually influence each other by means of this third variable.
  3. Given the unsupervised nature of these first two steps, this is where interesting insight can be obtained, but also where expert knowledge is required. In fact, depending on data quality and quantity, it might be the case that there are links that really do not make sense, and there could also be links that should be there but are not. In this case, it is responsibility of the expert in the particular field - in this case, someone knowledgeable about policymaking in education in Colombia - to adjust these links so that they make sense. This must not be an extensive surgery, otherwise the whole process would loose its meaning, and the more surgeries one performs, the easier is to break the maths of the model. But, if a model trained on data would not show a causality relationship going from socio-economic status to student performance, the expert should definitely impose this on the model. This is an optional step, but a very recommended one because it is the moment in which the machine learning algorithm can be taught things he can't discover from the data.
  4. In the next step, once we are happy with the network, the parameters are calculated to allow us to make simulations.
Once the model is trained, we can use it to make predictions. This predictions must not be thought of completely accurate. In fact, we know as a fact that, even if we planned the whole study, there are many variables that are hidden to us, simply because we don't have them. When we want to make a prediction, we can operate on the variables that we have, but not on  those we don't have. This means that when we slightly change the model, the results that we obtain are just correct as a first order approximation, but they might be significantly different in quantitative terms. We think, though, that the power of this tool is that it offers ideas and insight that the user could not possibly come up without putting all these data together.

How the model is used

There are two major ways of using this model. They basically do the same thing, but in terms of human thinking they are radically different. Understanding these two ways of probing the bayesian network is crucial with regards to study planning, and also with regards to actually using the model.
  1. The straight way: setting the causes and looking at the consequences. This is the way these model have been traditionally used for. To make it clear we will use an example. Imagine you have enacted a policy that gives funds to school to improve the school environment in terms of buildings, labs, libraries, etc. This method allows you to check how the other changes will change if you enact this policy. Maybe it will show that the quality of teaching improves and that the performance of students at the exams improves. The answer will be quantitative, but as said before, we should take this as just a guide. If we make a certain change in the school environment, the change in the students performances will be as significant, less significant or more significant? 
  2. The reverse way: setting the consequences and looking at the causes. The previous step was already useful on  its own, but we believe that the possibility to operate this model in this way makes it even more interesting, because it is more focused on the impact we want to deliver, and on understanding how to do that. In the reverse way, we first set what we would like the results of our policy to be. Always working by examples, imagine that we would like to improve the rate of students that drop school or have to repeat the year. Then we may say that we want all the students to be in the lower possible class, i.e. that with minimum percentage of students dropping school or repeating the year. If we do that, the model will automatically calculate how the other variables should change in order to achieve that result. Maybe it will tell you that you need to improve the school environment, to increase the quality of teachers, to centralise students to bigger school. As before, these will be quantitative predictions, but they should be only taken as qualitative suggestion in terms of ideas for policies that could be enacted in order to reach a result. Some of them will be unfeasible, some will be easier than others. At this point, it is responsibility of the expert to analyse and summarise these predictions and make the most out of it.