Describe the data we use
Cleaning Data
One of the first step of a data-driven project is always cleaning the data. Cleaning the data means putting them in a format that is manageable with the models and software that will be used. This is usually the most time-consuming part of a project for several reasons. In fact, on average a data scientist spends 90% of their time to collate and clean data, and of course the percentage of this time was strongly dependant on how good the quality of data is at the beginning.
In this project, the data cleaning has followed a long process given the huge variety of data formats and other problems. First, people in Thot group have cleaned the data putting them in a simpler text format, and then we had to do another bit of cleaning to put them in a format that is tractable by the model that we have built.
This is very important and a point that we want to make clear. Data cleaning is an important process of the work flow and it is not done within the model. This means that to be able to use the model data will need to be given in input in the right format.
.CVS file, comma separated, with headers.
In practice, it should be a big matrix where the first line contains the names of variables.
Issue of data quality
The main issues relative to the data that we have received are illustrated here. We report this because data quality is a very important factor in a project involving data. We report what was not good with them, with the purpose of informing the future choices you are going to make with regard to data selection and preparation. How explained before, having good data will significantly reduce the time needed to clean it. We want to remind you that the model is not usable if not provided with the data in the correct format.
Data integrity
Integrity of data is very important, because it is crucial in the choice of using or not using a particular data set. There might be several problems that would preclude us to use a data set:
- There might be missing data. In general, missing data is not a big problem, because there are techniques to deal with it. But this is just up to a certain point. Many of the data sets we received had variables with a significant amount of missing values - sometimes in the order of 50-70%. This might happen when no planning was made before starting to collect the data. This means that at some point in time, i.e. starting from a certain year, new variables pop up that were not there before. If we were to use these variables, we could only use the records (of rows) that are available for all the other variables. If you consider that this could happen for many different variables, you can see that this quickly reduces the amount of data that we can actually use.
- Names of variables. Sometimes names of variables don't make sense, are not intelligible and cannot be used because of this. We are trying to make a model which should be human-readable and intelligible, and if variables names do not make sense we have to exclude those variables. Sometimes two or more variables get merged into a single one that has a new name, or maybe even contains new different data. Without information about how this process has been done and how these choices have been made, again we have to exclude such variables.
Data coherence
These are much more common problems, but still significant ones and that can be reduced. Here the focus is on how we can put different data sets together in order to enable us to discover relationships that the single data sets could not possibly show.
Data sets are not coherent across different files for the following reasons:
- Data formats. Data may have been stored in different formats such as excel spreadsheets, cvs files, text files, and may use features relative to the particular software that was used to produce them;
- Time. Some data sets have data for a certain period of time, and other data sets for another one. If they overlap this is good, but often the overlap is very small, of just one or few years. This, in general, means a very big reduction of the number of records (of rows) that we can use, limiting the power of the model or making it impossible to train.
- Granularity. Some data sets that we have been provided have information on a student level. Some of them on a school level. Some of them on a comuna level. Data on school level can be merged with data on a student level, provided we know what school that student went to. This is a feasible merging, even though all the student of a certain school will have the same values for certain variables because of this difference in granularity. The same goes for comunas. This can be used to discover new relationships among variables that were not contained in the original data set, but on a different level we are loosing accuracy. This is a tradeoff which is subject of discussion. Our recommendation is that we can go from the student to the school level but not much further, otherwise correlation could be very strong but for an artificial reason.
- Key Variables. Following from the previous point, to merge two different data sets we need a key variable, some variables whose value allows us to link the two data sets. For example, if we want to merge a data set on a student level and one on a school level, the key variable is the school. The name of the school is not suitable because of the way data are recorder. It is very likely that the same school is not written in the same way in two different data sets. There might be typos, there might be abbreviations. A better key variable is the DANE code. We found, during the cleaning process, that there are different kinds of DANE codes and that some data sets contain one type, some others contain a different one, and it took some time to merge these data sets. In this case the process has only been difficult. In some cases the process is not even possible. This is the case when we for example have two data sets on a student level. In general each of those will contain a sample of the student, and even if they contain all the student, in general they do not have an identifier, a key variable, that allows us to link the data about the same student in two data sets. If we wanted to do that, we would have needed to consistently use a unique identifier when collating those data sets.
The need of planning
Methods
Provide a conceptual understanding about Bayesian networks, don't go into a detailed explanation of the mathematics, since this isn't their area of expertise, and its not required for us to give an explanation of the mathematics, which they could find in a good text book. Rather give a sufficient explanation that they can understand easily.
Provide an explanation as to why BN could be an effective method of analysis and formulating education and innovation policy, such as being able to identify interventions etc.
Provide an explanation of how our model works, describing the variables and relationships between variables etc.
Discussion
Provide an explanation of what insights can be derived from the data we have been given and what insights cannot be derived from the data given - explaining that insights are limited by quality of data and number of variables etc.
Explain what type of questions can be asked from a BN model and highlight the difference between this approach to other traditional methods of statistical analysis; a conceptual understanding, so they understand how to apply BN - highlighting why BN is used because of questions we have about a phenomena etc.
Explain how policymakers should formulate research questions when using BN approach etc.
Explain how they should collect data for future.
Explain how they can build on the model we've built.
Results
Main question "What should we do to improve the score on the saber11 test". Pictures and discussion.
Conclusion
Strong focus on:
Potential
Importance of collecting right data depending on the question.