Citi Bike Riders Exploratory Analysis
The Citi Bike project in New York City was launched in 2013 and has since seen growth in usage throughout the city. In this experiment, we want to explore the age distribution of male and female bikers to examine how the Citi Bike delivery system can be better designed to serve user needs and attract more customers. Data cleaning and manipulation were implemented in Python. A null hypothesis significance test was conducted with a one tail z-score test. By designing the experiment with rigorous scientific theories and reproducible mechanism, the result shows middle-aged men are less likely to ride a bike than middle-aged women.
Trip histories of Citi Bike riders were obtained through the NYC Citi Bike System Data portal. Data from March 2015 was used having the appropriate size for this exploratory analysis. The task of processing the data required identifying relevant variables, filtering for the appropriate records and then calculating the correct gender and age groups. Python was used to run the analysis trimming the dataset to just the following fields: tripduration, usertype, birth year and gender. The data set was then filtered for the “Subscriber” user type to remove data from one-time users identified as “Customer”. This was a remedial step to ensure that the analysis focus on more frequent users. In order to calculate the ratio of riders by age and gender, male and female groupings were each further grouped by birth year. The age of 45 was selected for this analysis which placed the birth year cutoff at 1971. Those born after 1971 were counted and labeled as above 45 for both genders. Below are the Python scripts:
Remove fields not required.
df.drop(['starttime', 'stoptime', 'start station id', 'start station name', 'start station latitude', 'start station longitude', 'end station id', 'end station name', 'end station latitude', 'end station longitude', 'bikeid'], axis=1, inplace=True)
Filter data set to remove one-time users.
df1 = df[df.usertype != 'Customer']
Identify and count number of male riders above 45
df_m_above45 = (df1['birth year'][df1['gender'] == 1]).groupby(df1['birth year'] < 1971.0).count()
Identify and count number of female riders above 45
df_w_above45 = (df1['birth year'][df1['gender'] == 2]).groupby(df1['birth year'] < 1971.0).count()
The null hypothesis was set as “the ratio of man above age 45 to man aged 45 or below riding a bike is the same or greater than the ratio of woman above age 45 to woman age 45 or below riding a bike.” The alternative hypothesis is that “the ratio of man above age 45 to man aged 45 or below riding a bike is the smaller than the ratio of woman above age 45 to man aged 45 or below riding a bike”. Furthermore, the significance level was set at alpha=0.05.
For large samples more than 30, we decide to use z-test. When alpha = 0.05, the critical values of z are +1.65. H0 is rejected if z > +1.65. In our z-test, z=18.61, so we can reject our null hypothesis.
Citibike Analysis - Study on customers/subscribers ratio change during weekends
Citi Bike Project #By Laura Gladson, Santiago Carrillo, Alexey Kalinin, Nonie Mathur, Maisha Lopa.
and 6 collaborators
New York City keeps records of Citi Bike services, including demographics of users and statistics on bike use. Here, we performed a statistical analysis to determine the relationship between biker age and trip duration, testing the alternative hypothesis that Citi Bike users under age 35 are more likely to bike for longer durations than the average user. Through a simple Z-test, we were able to reject our null hypothesis, concluding that trip duration of bikers under 35 is significantly greater than the average user.
For this project, our research question was:
Are Citi Bike users under 35 years of age significantly more likely bike for longer durations compared to the average user?
For this analysis, we formed the following hypotheses:
Null Hypothesis: The mean trip duration of Citi Bike users under the age of 35 is the same or less than the mean trip duration of an average user, significance level = 0.05.
Alternative Hypothesis: The mean trip duration of Citi Bike users under the age of 35 is more than the mean trip duration of an average user, significance level = 0.05
To test these hypotheses, we chose Citi Bike data from December 2015. The information downloaded from the data facility contained more variables than needed to compare age and trip duration. Additionally, it was not organized in columns, which could led to errors, such as interpreting variable names as observations. As such, we first organized our data into columns, then dropped 13 of the 15 categories. We were left with “birth year” as our independent variable, and “trip duration” in seconds as our dependent variable. After plotting both variables, we identified several outliers of impossibly old users, i.e., those born before 1910.
Plot 1 shows a scatter plot of the raw data, plotting birth year against trip duration. Histogram 1 shows the raw distribution of age across the data set. In Histogram 3, the distributions of trip duration for the entire data set (in blue) and for the group of those 35 and under (in green) are compared.
Our peer reviews suggested we perform a Z-test to compare the information of users under 35 and the total population. This test is possible because we know the population parameters (since dataset itself represents the entire population of Citi Bike users). Given the size of our sample, and the fact that we know the mean and standard deviation for both both groups, we chose to test our hypothesis with a Z-test. As such, we first had to calculate the mean and standard trip duration for the two groups. These values were plugged into the Z-test formula.
From our Z-test, we obtained a Z-statistic of 17.79. From the Z-Table, this gave an area of over 0.9998. Thus, our p-value is (1 - 0.9998), or 0.0002, meaning there is a 0.02% probability that the difference observed between the two groups is due to chance alone. Specifically, this p-value is much smaller than our alpha level of 0.05, meaning we can reject our null hypothesis, and can conclude that trip duration times of Citi Bike users are longer for those under age 35 compared the average user.
LINK TO ORIGINAL NOTEBOOK
PUI2016 _citibike_ Summary
and 4 collaborators
PUI2016 Citibike Project Summary
In this project we looked at whether on average older individuals (over 40 years old) used Citibikes for shorter trips than younger individuals(less than 40 years old). Using information on trip duration and rider age for the month of February 2015, we ran a Z-test test for the proportions grouped by trip duration, yielding at statistic of 26.09. In this case we will reject the null hypothesis and conclude that older individuals are more willing to take shorter trips.
We used the zip file on the Citibike's website corresponding to the month of February 2015. The data can be downloaded here: https://s3.amazonaws.com/tripdata/201502-citibike-tripdata.zip
The corresponding .csv file contained entries for the start and stop station location, trip duration, customer type, birth year and gender of each rider during the month. We extracted age by subtracting the birth year of subscribers from the then current year 2015, and dropping all entries except trip duration and age. We split the pandas dataframe into those over and under 40 to create 2 samples. Then we divided the trip duration into two categories as short trip(less than 10 mins) and long trip(more than 10 mins) (see Figure 1). At last we normalized the distribution(see Figure 2).
minority propensity to coauthor scientific publications
When we discuss under-represented minorities (URDs) in the academic sciences, we often mention the importance of mentors and of providing role models to minorities. However, other than anecdotal evidence, there is no measure of whether having a minority role model actually facilitates the academic path. This work tries to answer the simple question: are minorities more likely to co-author papers within their minority circle?