Citi Bike Riders Exploratory Analysis
The Citi Bike project in New York City was launched in 2013 and has since seen growth in usage throughout the city. In this experiment, we want to explore the age distribution of male and female bikers to examine how the Citi Bike delivery system can be better designed to serve user needs and attract more customers. Data cleaning and manipulation were implemented in Python. A null hypothesis significance test was conducted with a one tail z-score test. By designing the experiment with rigorous scientific theories and reproducible mechanism, the result shows middle-aged men are less likely to ride a bike than middle-aged women.
Trip histories of Citi Bike riders were obtained through the NYC Citi Bike System Data portal. Data from March 2015 was used having the appropriate size for this exploratory analysis. The task of processing the data required identifying relevant variables, filtering for the appropriate records and then calculating the correct gender and age groups. Python was used to run the analysis trimming the dataset to just the following fields: tripduration, usertype, birth year and gender. The data set was then filtered for the “Subscriber” user type to remove data from one-time users identified as “Customer”. This was a remedial step to ensure that the analysis focus on more frequent users. In order to calculate the ratio of riders by age and gender, male and female groupings were each further grouped by birth year. The age of 45 was selected for this analysis which placed the birth year cutoff at 1971. Those born after 1971 were counted and labeled as above 45 for both genders. Below are the Python scripts:
Remove fields not required.
df.drop(['starttime', 'stoptime', 'start station id', 'start station name', 'start station latitude', 'start station longitude', 'end station id', 'end station name', 'end station latitude', 'end station longitude', 'bikeid'], axis=1, inplace=True)
Filter data set to remove one-time users.
df1 = df[df.usertype != 'Customer']
Identify and count number of male riders above 45
df_m_above45 = (df1['birth year'][df1['gender'] == 1]).groupby(df1['birth year'] < 1971.0).count()
Identify and count number of female riders above 45
df_w_above45 = (df1['birth year'][df1['gender'] == 2]).groupby(df1['birth year'] < 1971.0).count()
The null hypothesis was set as “the ratio of man above age 45 to man aged 45 or below riding a bike is the same or greater than the ratio of woman above age 45 to woman age 45 or below riding a bike.” The alternative hypothesis is that “the ratio of man above age 45 to man aged 45 or below riding a bike is the smaller than the ratio of woman above age 45 to man aged 45 or below riding a bike”. Furthermore, the significance level was set at alpha=0.05.
For large samples more than 30, we decide to use z-test. When alpha = 0.05, the critical values of z are +1.65. H0 is rejected if z > +1.65. In our z-test, z=18.61, so we can reject our null hypothesis.
Citibike Analysis - Study on customers/subscribers ratio change during weekends
Citi Bike Project #By Laura Gladson, Santiago Carrillo, Alexey Kalinin, Nonie Mathur, Maisha Lopa.
and 6 collaborators
PUI2016 _citibike_ Summary
and 4 collaborators
PUI2016 Citibike Project Summary
In this project we looked at whether on average older individuals (over 40 years old) used Citibikes for shorter trips than younger individuals(less than 40 years old). Using information on trip duration and rider age for the month of February 2015, we ran a Z-test test for the proportions grouped by trip duration, yielding at statistic of 26.09. In this case we will reject the null hypothesis and conclude that older individuals are more willing to take shorter trips.
We used the zip file on the Citibike's website corresponding to the month of February 2015. The data can be downloaded here: https://s3.amazonaws.com/tripdata/201502-citibike-tripdata.zip
The corresponding .csv file contained entries for the start and stop station location, trip duration, customer type, birth year and gender of each rider during the month. We extracted age by subtracting the birth year of subscribers from the then current year 2015, and dropping all entries except trip duration and age. We split the pandas dataframe into those over and under 40 to create 2 samples. Then we divided the trip duration into two categories as short trip(less than 10 mins) and long trip(more than 10 mins) (see Figure 1). At last we normalized the distribution(see Figure 2).
minority propensity to coauthor scientific publications