ABSTRACT (SHORT SUMMARY OF THE IDEA, THE ANALYSIS, THE RESULT) We asked the question, "is there a relationship between the age of Citi Bike riders and the duration of their trips?" Our alternative hypothesis is that these would be negatively correlated; in other words, as age goes up, trip duration goes down. The null hypothesis is that age has no correlation with or a positive correlation with trip duration. We used a 0.05 confidence interval. H0 = as age increases, trip duration does not change or increases DATA (DESCRIBE THE DATA YOU USED AND VERY BRIEFLY WHAT DATA WRANGLING WAS NEEDED) We pulled Citi Bike data from two months: January 2016 and July 2015. We removed all columns except for date of birth and trip duration. We created a variable for age by subtracting the current year (2016 and 2015, respectively) from the year of birth. At the suggestion of a reviewer (and in order to run the Pearson test), we dropped NaN data from both variables and reduced the larger one to match the size (by random elimination) of the smaller. The same reviewer suggested we remove outliers, which we didn't do, but this would be a good idea, particularly for the age variable (potentially remove values above 80 or 100). ANALYSIS (DESCRIBE THE ANALYSIS YOU DID, WHICH TESTS YOU USED, WHY?) First we created a scatter plot of the data. For correlation analysis, we chose a Pearson test, since we have just two variables. One of our reviewers suggested we should do a multiple regression test with other variables, but we chose to stick with just the question of age and trip duration. Another reviewer suggested we use an OLS test, but we felt correlation was appropriate for two variables. OLS might have been good if we had additional variables. RESULT (DESCRIPTION OF THE RESULT AND CONCLUSION FROM THE ANALYSIS) The January 2016 data showed a correlation coefficient of .20 and a p-value of 0.0. The tells us that age and trip duration are slightly positively correlated, and a p-value below 0.05 lets us accept this conclusion. So far the null hypothesis stands. We ran the test again on the second data set - July 2015 - and found similar results. A correlation coefficient of .144 and a p-value of 0.0. In this sample the relationship between the two variables is slightly weaker, but the conclusion is the same: age and trip duration are positively correlated. The null hypothesis cannot be rejected.
GitHub: adriandahlin NetID: akd361 Problem Description: Which individual votes cast in US presidential elections carry the most influence over the outcome? Much has been made of the disparity between the number of voters per electoral vote in small, rural states versus highly populated states. I will take this topic one step further and weight the value of a vote based on each state's likelihood to determine the electoral outcome. Data: I will use data from the US Census Bureau to measure the number of eligible voters and actual voters per state. Each state's number of electoral votes is easily calculated (number of representatives in House + Senate). These two data sets allows for creation of a simple variable: voters per electoral vote per state. Then I will look at election results going back at least as far as 2000 as see which states have varied the most between parties. The states that varied more will be given more weight in the ultimate calculation of value per vote per state. Analysis: The tools used will be primarily Python, plus perhaps ArgGIS for an additional visualization. References: Work of the electoral college team from CUSP Hackathon day here. Many news stories exploring this topic right now. My state senator friend in Massachusetts who has proposed that Massachusetts reject the electoral college. He could be an outlet for this analysis once it's done. Deliverable: A rating of each state based on the power of a vote in that state (plus Nebraska and Maine split). A chloropleth map of the US showing the variance in voter power.