The Citi Bike project in New York City was launched
in 2013 and has since seen growth in usage throughout the city. In this experiment, we want to explore the
age distribution of male and female bikers to examine how the Citi Bike delivery
system can be better designed to serve user needs and attract more
customers. Data cleaning and
manipulation were implemented in Python.
A null hypothesis significance test was conducted with a one tail
z-score test. By designing the experiment with rigorous scientific theories and
reproducible mechanism, the result shows middle-aged men are less likely to
ride a bike than middle-aged women.
histories of Citi Bike riders were obtained through the NYC Citi Bike
System Data portal
. Data from March
was used having the appropriate size for this
exploratory analysis. The task of processing the data required
identifying relevant variables, filtering for the appropriate records and then
calculating the correct gender and age groups. Python was used to run the
analysis trimming the dataset to just the following fields: tripduration,
usertype, birth year and gender. The data set was then filtered for the “Subscriber”
user type to remove data from one-time users identified as “Customer”.
This was a remedial step to ensure that the analysis focus on more frequent
users. In order to calculate the
ratio of riders by age and gender, male and female groupings were each further
grouped by birth year. The age of 45 was
selected for this analysis which placed the birth year cutoff at 1971. Those born after 1971 were counted and labeled
as above 45 for both genders. Below
are the Python scripts:
fields not required.
df.drop(['starttime', 'stoptime', 'start station id', 'start station name', 'start station latitude', 'start station longitude', 'end station id', 'end station name', 'end station latitude', 'end station longitude', 'bikeid'], axis=1, inplace=True)
data set to remove one-time users.
df1 = df[df.usertype != 'Customer']
and count number of male riders above 45
df_m_above45 = (df1['birth year'][df1['gender'] == 1]).groupby(df1['birth year'] < 1971.0).count()
and count number of female riders above 45
df_w_above45 = (df1['birth year'][df1['gender'] == 2]).groupby(df1['birth year'] < 1971.0).count()
hypothesis was set as “the ratio of man above age 45 to man aged 45 or below
riding a bike is the same or greater than the ratio of woman above age 45 to
woman age 45 or below riding a bike.” The alternative hypothesis is that “the
ratio of man above age 45 to man aged 45 or below riding a bike is the smaller
than the ratio of woman above age 45 to man aged 45 or below riding a bike”.
Furthermore, the significance level was set at alpha=0.05.