We have performed this analysis for the network we considered for this project, and it is reported in the results section.
This concludes the summary of the methods used in this project, in the next section we are going to explore what you can do with the model we are providing you, and how to build on that for the continuation of this project, and for future utilisation in general.
Discussion
In the previous sections we have provided an illustration of all the tools that we have used during this project. The aim of this section is to explain how you can start from there and take the project forward.
Insights from the data
What we have been trying to do in this stage of the project is something extremely interesting which is unprecedented in the history of policymaking in Colombia, and it can be summarised like that:
The value of the sum of the data sets is not the sum of the values of the data sets. In fact it is greater.
This aspect is very important and it is the core part of this project. When we put together different data sets and we are able to link them together by means of a key, a common variable which uniquely identifies a student, or more likely a school in our case, we can discover correlation that are not only into data sets, but most importantly across data sets. This is the real value of this kind of project.
As we have seen, once we discover correlations, we can do a further step and infer causality relationships, what drives what. And this is the kind of insight that we need in order to advise the policymaking process, and produce data-driven policies.
As we have discussed before poor, planning and poor data quality makes this value potentially very high, but in practice not particularly high. Still, there is information that we can draw from putting together these data sets that we could not have gotten without doing that.
In this study the data available to us was mainly at a school or student level. Data sets at school level could be linked to data sets at student level, but data sets at a student level could not be linked among themselves because there was no unique identifier. For this reason, we missed part of the information that we could have gotten by merging those data sets. Furthermore, data on a scale greater than school level (e.g. comunas) would make things too similar. It is too distant from the student level such that students in different comunas might be very similar or students within the same comuna might be very different without belonging to a particular comuna explaining this. We think that for this kind of study, it would be good to have data on a student level, but if this cannot be achieved consistently, then the school level should be used.
Also, the correlation matrix should be used to reduce the number of variables to a tractable one. In fact, variables that are too much correlated with each other in blocks (like for example the scores in the various subjects at the saber 11 exam) do not add significant information. They should be all excluded but one, or we could just take the average. This is what we did to reduce the number of variables. Furthermore, when variables change over time (some are added, some are no more recorded, some are merged), this prevents us to use them in an efficient way. Once a period for the experiment is set, we should not be changing variables during the set time period.Finally, many variables had unintelligible names, we just did not know what they were.
For the future, then, we recommend that you seriously devote a part of the project into planning what is the kind of actions that you are thinking to take in the mid-long term, in order to carefully identify the variables that will better help in this process. Doing that, though, is an entire project on its own, so I will not spend further words on it.
What we can highlight again is that even if this data will be collected and managed by different institutions and departments, having common rules on how this data gathering is performed and how data are recorder will help maximise both the amount of data that you can actually use within datasets, and the amount of excess value that putting them together will produce.
Insights from the model
The model, in this case, is what we have called Bayesian Network or, alternatively, Belief Network. The reason for this second name is due to the fact that the network not only encodes informations from the data, but also informations that are not in the data but are in the mind of the expert who uses it. This is a very important thing to keep in mind.
Traditional statistical methods, like also the correlation matrix that we described above, allow us to understand if any two variables are correlated, how intense is this correlation, and we could also use regressions to describe a variable in terms of a set of other variables. None of them, though, would be so human readable and infer causation as well.
This is the power of Bayesian Network, but in order to be able to use them we should understand what kind of question we can ask and what kind of question we cannot ask. Given the very complex nature of the system we are trying to analyse, and given the poor quality of our data, we cannot rely too much on the numbers being accurate. In principle we could, if our data were much better, but even in that case we would have an enormous number of hidden variables that would make the problem too complex to be able to have an accurate quantitative analysis. What we can do, instead, is make a qualitative analysis which anyway involves performing calculations.
As discussed before, there are two types of questions that can be asked to the model: either the cause is set and the consequences observed, or the consequence is set and the causes observed. Every question should fall in one of these two categories. Remember that any answer the model will give you will be based on past data, and on your expert input. The model will give you quantitative information but we believe that, at this stage, these should be carefully considered, and actually used to get qualitative insights.
Examples of this questions have been discussed in the methods part. One thing we would like to focus the attention to is the fact that the expert should always read critically these results. For example it might tell you that to increase the socio-economic status of a student you need to increase the family income. Does it mean that you should just give money to the families and this will make students performance better? The expert, the policymaker is the judge on if this makes sense, if it is feasible, or how to make it happen.
You should look at Bayesian Networks as a source of ideas rather as a source of solutions. It is not a generator of policies but rather a tool to give you hint on what might be a potentially interesting path to take.
We will analyse a sample question in detail in the Results section, trying to highlight all the pros and cons of this method and what you should be careful of.
How can you build on the model
The model that we provide you is just a prototype. We have worked to produce a user interface that will help you interact with it, but at the moment it is not usable without any help from someone able to handle data. In fact, there is no substitution for the data cleaning phase. This is the one step that it is not possible to overcome, i.e. making the data in a format that the model can understand.
Having established that, and taking into account all of the considerations that have been discussed before about data quality, there is potentially no much limit at how many variables can be implemented. In fact, the very important limit should be human readability. Always remember that this software, at this stage, is not meant to build policies for you but instead to help you figure out what factors to keep into account. To do so, it is extremely important that you can read it, and to do so, you have to accurately select the variables that you want to observe.
Depending on what you need to do, you might need to build the network on the whole data set and then only look at a certain number of variables, or you might want to train the model on the small number of variables since the beginning. In general, the first option is preferred, but if you need a clear and understandable structure, you might want to prefer the latter.
The better is the data that you will be able to collect in the future, the more accurate will the quantitative results of the model be. This means that with good quality data, you might even consider to trust the numbers that the model gives you. Unfortunately, at the current stage, we don't have any implementation of uncertainty quantification. This means that you will have just a single values, and not an interval of confidence. This is something that would need to be explored in the future, but it would require more resources than the ones that have been allocated to this stage of the project. Nevertheless, it would definitely increase the scientific profile of the model.
Results
In this section we will illustrate the process of answering a specific question, looking at all the steps that can help us get to the conclusions. Given the data available to us, we can only choose a very simple question. This is good, because its results will be intuitive, and this means that we can easily check if they make sense or not.
The question we are interested in is the following:
How can we improve the performance of students at school? In other words, on what variables can we act in order to improve the score of the students on the Saber 11 test?
The approach we will follow is divided in 6 steps:
- Selection of the variables;
- Building of the network;
- Introducing expert knowledge;
- Training the network;
- Evaluating scenario;
- Drawing conclusions.
Selection of the variables
Among the data sets we had available, first made a choice of considering variables at the student level. The reason for that is that the biggest data set we had was at the student level, and it was the one with the results of the Saber 11 exam, containing also much other information about the students and their families. Then, we merged this data set with a few data sets that were at the school level, because that was the kind of resolution available for those data sets. This means that the values at the school level will be the same for all the students at the same school.
The initial number of variables was already much smaller than the total number of variables available to us. The reason for this is that, to make the data set workable, we excluded all variables with more than 70% of missing data, and also we excluded all variables that we could not figure out what they were. Once done so, these data sets have merged by the DANE code (i.e. by school) and then we have taken all the records (the rows) without missing variables. This reduced significantly the dimension of our data set, but we still had tens of variables.
Building a network with such a number of variables is not advisable, both because the network would not be stable, but most importantly because it would be easily interpretable by people. For this reason we used the correlation plot shown in Fig. \ref{636017} to identify groups of variables that were too correlated between each other.