Welcome to Authorea!

This study analyzes the volume of Citi Bike trips in New York City during March of 2016 and aims to determine if the average number of trips on the weekdays is greater than the average volume of trips on the weekends. This difference, or lack thereof, would indicate that Citi Bike subscribers primarily use the service to commute to and from work. Conducting a difference in means t-test at a .05 significance level, I am able to reject the null hypothesis, that Citi Bike ridership on the weekdays is less than or equal to the ridership volume during the weekends.

In order to conduct this study, I acquired and analyzed Citi Bike data. The original dataset included several columns of information that were irrelevant to this project, so I removed everything except the columns hosting information about trip start time and trip duration. (Initially, I did not remove all columns, but decided to do so given Ian Wright's feedback on my notebook). I created a new 'date' column inside the Pandas dataframe, which was capable of reading and grouping data with date and time information. With that, I generated a final 'weekday' column, which associated the date with a categorical assignment corresponding to weekday: 0 = Monday through 6 = Sunday. I then summed the number of rides by weekday and plotted the seven counts. Finally, I reset the dataframe index to date and time in order to run a count of all rides that occurred by date. I reapplied the weekday iterator to associate a weekday with each date. These daily counts were then separated into two lists, one list has a count by day of each weekday and the other has a count of each weekend day.

With daily weekday and daily weekend trip counts in lists, I was able to run a t-test, which determines whether the difference in means of two samples is statistically significant. Per both Ian and Kevin Han's recommendations, I chose the t-test because because I do not have the population parameters (mean and standard deviation).

Testing my data at a significance value of .05, allowed me to reject the null hypothesis, which states that the Citi Bike ridership on the weekdays is less than or equal to that on the weekends. I can reject the null hypothesis because the p-value returned in the t-test is .01.

While I can reject the null hypothesis, this experiment leaves room for further analysis. For example, I only test one month of data here, and while there are many rides in one month, a more robust analysis, and perhaps predictive analysis, could be created by using more data. Because Citi Bikes are outdoor methods of transportation, I anticipate that ridership volume in March may not be indicative of annual ridership patterns.

Link to Github repository: https://github.com/kristikorsberg/PUI2016_kk3374/tree/master/HW6_kk3374

federica B biancoover 2 years ago · Public