PUI2016 Citibike Project Summary

In this project we looked at whether on average older individuals (over 40 years old) used Citibikes for shorter trips than younger individuals(less than 40 years old). Using information on trip duration and rider age for the month of February 2015, we ran a Z-test test for the proportions grouped by trip duration, yielding at statistic of 26.09. In this case we will reject the null hypothesis and conclude that older individuals are more willing to take shorter trips.

We used the zip file on the Citibike's website corresponding to the month of February 2015. The data can be downloaded here: https://s3.amazonaws.com/tripdata/201502-citibike-tripdata.zip

The corresponding .csv file contained entries for the start and stop station location, trip duration, customer type, birth year and gender of each rider during the month. We extracted age by subtracting the birth year of subscribers from the then current year 2015, and dropping all entries except trip duration and age. We split the pandas dataframe into those over and under 40 to create 2 samples. Then we divided the trip duration into two categories as short trip(less than 10 mins) and long trip(more than 10 mins) (see Figure 1). At last we normalized the distribution(see Figure 2).

NULL HYPOTHESIS:

The ratio of elder people for longer trip duration over elder people for shorter trip duration is the same or higher than the ratio of younger people for longer trip duration over younger people for shorter trip duration.

ALTERNATIVE HYPOTHESIS:

The ratio of elder people for longer trip duration over elder people for shorter trip duration is lower than the ratio of younger people for longer trip duration over younger people for shorter trip duration.

Equation of null hypothesis and alternative hypothesis:

H0 : Senior_LongerTrip / Senior_All >= Young_LongTrip / Young_All

HA : Senior_LongerTrip / Senior_All < Young_LongTrip / Young_All

As we were testing whether the ratio of seniors who take longer trip duration of the group is lower than the youth group, this was a proportion test, requiring either a t or z test. As there were over 200000 entries across both groups, the t and z statistics are approximately equal.

Our Z-test yielded a test statistic of approximately 26.09, and since we ran a 1-tailed t-test, our p-value is approximately 1, leading to us reject the null hypothesis at the 0.05 significance level. As our surveyed month of February 2015 fell directly in the middle of winter, our samples may suffer from self selection bias by favoring the more experienced and "hardcore" bikers. Thus our sample may not necessarily be representative of the Citibike rider population at large.

https://github.com/xy1002/PUI2016_xy1002/blob/master/HW6_xy1002/HW6_xy1002_2.ipynb

federica B biancoover 2 years ago · Public