Citi Bike Riders Exploratory Analysis

Abstract

The Citi Bike project in New York City was launched in 2013 and has since seen growth in usage throughout the city. In this experiment, we want to explore the age distribution of male and female bikers to examine how the Citi Bike delivery system can be better designed to serve user needs and attract more customers. Data cleaning and manipulation were implemented in Python. A null hypothesis significance test was conducted with a one tail z-score test. By designing the experiment with rigorous scientific theories and reproducible mechanism, the result shows middle-aged men are less likely to ride a bike than middle-aged women.

Data

Trip histories of Citi Bike riders were obtained through the NYC Citi Bike System Data portal. Data from March 2015 was used having the appropriate size for this exploratory analysis. The task of processing the data required identifying relevant variables, filtering for the appropriate records and then calculating the correct gender and age groups. Python was used to run the analysis trimming the dataset to just the following fields: tripduration, usertype, birth year and gender. The data set was then filtered for the “Subscriber” user type to remove data from one-time users identified as “Customer”. This was a remedial step to ensure that the analysis focus on more frequent users. In order to calculate the ratio of riders by age and gender, male and female groupings were each further grouped by birth year. The age of 45 was selected for this analysis which placed the birth year cutoff at 1971. Those born after 1971 were counted and labeled as above 45 for both genders. Below are the Python scripts:

Remove fields not required.
df.drop(['starttime', 'stoptime', 'start station id', 'start station name', 'start station latitude', 'start station longitude', 'end station id', 'end station name', 'end station latitude', 'end station longitude', 'bikeid'], axis=1, inplace=True)
Filter data set to remove one-time users.
df1 = df[df.usertype != 'Customer']
Identify and count number of male riders above 45
df_m_above45 = (df1['birth year'][df1['gender'] == 1]).groupby(df1['birth year'] < 1971.0).count()
Identify and count number of female riders above 45
df_w_above45 = (df1['birth year'][df1['gender'] == 2]).groupby(df1['birth year'] < 1971.0).count()

Analysis

The null hypothesis was set as “the ratio of man above age 45 to man aged 45 or below riding a bike is the same or greater than the ratio of woman above age 45 to woman age 45 or below riding a bike.” The alternative hypothesis is that “the ratio of man above age 45 to man aged 45 or below riding a bike is the smaller than the ratio of woman above age 45 to man aged 45 or below riding a bike”. Furthermore, the significance level was set at alpha=0.05.

For large samples more than 30, we decide to use z-test. When alpha = 0.05, the critical values of z are +1.65. H0 is rejected if z > +1.65. In our z-test, z=18.61, so we can reject our null hypothesis.

Figure1: Distribution of Citibike bikers by Age in March 2015, absolute counts

Figure2: Distribution of Citibike bikers by age in March 2015, normalized