Our task is now to identify groups of too strongly correlated variables. The variables in those groups are not adding more information on the model, but just making it complex. We could have the same information by simply considering one of those variables, or an average where appropriate.
For example, we can easily identify the square at the bottom right corner of Fig. \ref{636017}, which is constituted by the scores of the Saber 11 exam in different subjects. In a network this would be a fully connected sub-network, and we would get the same amount of information by only using their average (i.e. the variable PUNTAJE), making it much easier to read.
Following the diagonal upwards, the next box we find is the box related to the socio-economic status of the student's family. There are actually two variables coming from two different data sets that are the same. One though is at the school level (SES) and one at the student level (ESTU_ESTRATO), so we will consider the latter as representative of the socio-economic status, neglecting the income of the family and the other variables in the box.
The other two evident boxes there are are related to the occupation and education of the family. Here our choice, completely arbitrary, has been to consider the variable ESTU_TRABAJA.
Finally, considering the top-left corner, we decided to use all the three variables. EXTRAEDAD is the proportion of students in the school that are above the age they would be expected for that year, REPROBACION_MEDIA which is the proportion of students in a school who are repeating a year, and DESERCION_MEDIA which is the proportion of students dropping school. We did so because we thought that they were two different aspects, and we did not want to replace one with the other.
It is interesting to note that there are also values that are strongly anti-correlated. In fact, as one would expect, the rate of students repeating the year, and the rate of students passing the year are approximately 1 - the other.
In conclusion, the variables that we selected for our model are the following:
- PUNTAJE [student-based]: the score at the Saber 11 exam;
- ESTU_ESTRATO [student-based]: the socio-economic status;
- ESTU_TRABAJA [student-based]:
- TOTAL [school-based]: the number of students in the school;
- DESERCION_MEDIA [school-based]: rate of students dropping school;
- EXTRAEDAD [school-based]: rate of over-age students;
- REPROBACION_MEDIA [school-based]: rate of students repeating the year;
- AMBIENTE.ESCOLAR [school-based]: perception of the school environment, average among students, teachers, parents.
Building the network
There are many algorithms to build the network. In general, whatever algorithm we used, as a following step we use a hill-climbing algorithm to maximise the score of the network, i.e. reach a more stable solution. Following this procedure, with the variables we have just enumerated, we obtain the network in Fig. \ref{927752}.