<Jianghao Zhu, jz2575, jz2575>
<Jianghao Zhu, jz2575, jz2575>
This project is a continuation of Hack Day project at Center For Urban Science & Progress (CUSP) on November 19, 2016. Available county level election exit poll data were analyzed associated with different factors, for example, education levels, income, racial distribution and densities, and etc.. Multivariable linear model and Random Forest model have been built to assess different factors’ contribution to election results. From Multivariable linear model, foreign, bachelor degree, and racial diversity have influencing power. We define racial diversity as percentage of nonwhite voters. From Random Forest model, we can see income, bachelor degree, racial diversity, foreign percentage, population density has very close importance level to election result.
The idea of this project came from the far off 2016 pre-election predictions of most major newpapers and media companies, such as FOX and CNN. Hence, starting with a team and continually by myself, the goal is to dive into the Election 2016 Exit Poll data and try to find out what factors play important roles in determining election results. Due the popularity of the topic, numerous analyses have been published before and after the election. In this paper, I will explain things that have been discovered through data exploration and convey my understanding of these found. More importantly, the goal is to gain the real experience of the whole process of conducting a scientific research and presenting outcomes.
The data my team identified was prepared by Professor Stanislav Sobolevsky at CUSP uploaded to Maisha Lopa’s github repository. Lopa is a student in Sobolevsky’s Applied Data Science class, and the permission is granted by Professor Sobolevsky. The data has county level election exit polls and demographics related information, such as education, races, income, in 37 states. By the time we started the project, the 37 States data is available. It is good to have all the data from 50 States, but I think 37 States would be sufficient for this project because historically, election results most likely depend on those swing States. Therefore, 37 States should be a good presentation of the whole country. Sometimes, it is the price we have to pay between data and cost. And for Data Scientists, we should learn to use limited data to discover useful information, instead of spending too much resources on getting great amount of data. Data is somehow very expensive. And since we want a reproducible research, data can always be added afterwards when necessary.
Regarding data wrangling and process, specific columns are renamed for clarity. Values of data have been estimated and transformed to different forms for exploration, like sum up, take square root of values, and normalization. And due to the ways of conducting the exit poll, there would be errors and bias lie in the dataset, so the analysis of exit poll data would not be a 100% accurate representation of actual vote patterns and results. After all, let’s take a look of the data and get some sense of what they are. In first column, Fips stands for Federal Information Processing Standards which are county codes designed for different counties. Followed by columns named trump and clinton which were filled with corresponding number of votes. Area_name and state_abbreviation are apparent indications of county name and States abbreviation. After that, meanings of columns become unclear. Therefore, depends on factors would be assessed, corresponding column names have been changed to their clear expressions like the figure below. We can see “Population” to be clearly presented.
Similarly, we have converted columns of genders, incomes, races, education levels, and etc. respectively, since we want to find these factors’ importance level to election results. In addition, the proportions of electoral votes for different factors have been estimated based on population with age above 18.
Multivariable linear model and Random Forest Algorithm have been built and applied through the analysis. Multivariable linear model is a simple and effective model since we are looking to explore the relationships between different factors with the exit poll results of 2016 election. In the beginning, single factor function was built to assess each single factor. After that, a multivariable linear model was built which includes all chosen factors to see the overall contribution. After then, each factor was added one by one to see the effectiveness of each factor. In addition, normalized data was applied in a similar manner for exploration. Residual data was plot to see if there is any patterns since we want residual to be randomly distributed for the model
For random forest model, all the data of different factors are trained by random forest algorithm to calculate importance levels for these factors. Different data forms were assessed to explore difference and consistency of feature importance generated by random forest algorithm. Plots of outcomes were generated to visually assess the importance levels.
Based on the exploratory analysis, it is not easy to conclude what factors determine the election results. Multivariable linear model and random forest model do not really show consistency.
From Figure 1, around high racial diversity area, Trump’s votes converge at around 0.5. In contrast, around racial diversity, Trump winning percentages is more dense than losing percentages.
I wondered why importance levels of the five factors in Random Forest model are almost even. So I checked importance levels of all factors. The results show as Figure 3 below.
The maximum 4 factors turned out to be private employment percent change, mean travel time to work, live same house one year percentage, and retail sales per capita. It might be interesting to have a following up investigation on these factors in the future.