U.S Election 2016 Exit Poll Data Analysis and Exploratory

<Jianghao Zhu, jz2575, jz2575>

<Jianghao Zhu, jz2575, jz2575>


This project is a continuation of Hack Day project at Center For Urban Science & Progress (CUSP) on November 19, 2016. Available county level election exit poll data were analyzed associated with different factors, for example, education levels, income, racial distribution and densities, and etc.. Multivariable linear model and Random Forest model have been built to assess different factors’ contribution to election results. From Multivariable linear model, foreign, bachelor degree, and racial diversity have influencing power. We define racial diversity as percentage of nonwhite voters. From Random Forest model, we can see income, bachelor degree, racial diversity, foreign percentage, population density has very close importance level to election result.


The idea of this project came from the far off 2016 pre-election predictions of most major newpapers and media companies, such as FOX and CNN. Hence, starting with a team and continually by myself, the goal is to dive into the Election 2016 Exit Poll data and try to find out what factors play important roles in determining election results. Due the popularity of the topic, numerous analyses have been published before and after the election. In this paper, I will explain things that have been discovered through data exploration and convey my understanding of these found. More importantly, the goal is to gain the real experience of the whole process of conducting a scientific research and presenting outcomes.


The data my team identified was prepared by Professor Stanislav Sobolevsky at CUSP uploaded to Maisha Lopa’s github repository. Lopa is a student in Sobolevsky’s Applied Data Science class, and the permission is granted by Professor Sobolevsky. The data has county level election exit polls and demographics related information, such as education, races, income, in 37 states. By the time we started the project, the 37 States data is available. It is good to have all the data from 50 States, but I think 37 States would be sufficient for this project because historically, election results most likely depend on those swing States. Therefore, 37 States should be a good presentation of the whole country. Sometimes, it is the price we have to pay between data and cost. And for Data Scientists, we should learn to use limited data to discover useful information, instead of spending too much resources on getting great amount of data. Data is somehow very expensive. And since we want a reproducible research, data can always be added afterwards when necessary.

Regarding data wrangling and process, specific columns are renamed for clarity. Values of data have been estimated and transformed to different forms for exploration, like sum up, take square root of values, and normalization. And due to the ways of conducting the exit poll, there would be errors and bias lie in the dataset, so the analysis of exit poll data would not be a 100% accurate representation of actual vote patterns and results. After all, let’s take a look of the data and get some sense of what they are. In first column, Fips stands for Federal Information Processing Standards which are county codes designed for different counties. Followed by columns named trump and clinton which were filled with corresponding number of votes. Area_name and state_abbreviation are apparent indications of county name and States abbreviation. After that, meanings of columns become unclear. Therefore, depends on factors would be assessed, corresponding column names have been changed to their clear expressions like the figure below. We can see “Population” to be clearly presented.