Data-Driven Inference of CitiBike Data: An Analysis of Trip Distance by Rider Age

Abstract

The objective of this analysis was to perform a data-driven analysis of CitiBike trip data in New York City using statistical testing in python. Using CitiBike data from June 2016, the relationship between rider age and trip duration was explored. Specifically, the ratio of long distance trips to all distance trips in young riders was compared to that of all riders. Younger riders typically have more energy and strength, which translates into the ability to ride farther distances compared to all riders. The result of this analysis did not result in a significant difference between young riders and all riders, and therefore the null hypothesis could not be rejected.

Data

The data for this analysis was obtained from the CUSP Data Facility at New York University. The data was subset to only fields needed to calculate rider trip distance: Start Station Latitude, Start Station Longitude, End Station Latitude, End Station Longitude, and rider birth year. Next, geopy was used to calculate trip distance in miles between the stations. The data wrangling process is detailed in the linked ipython notebook.

Analysis

Through preliminary data inspection, the team took interest in long-distance trip in CitiBike riders. The team first defined null and alternative hypotheses:

Null Hypothesis:

Long distance trip ratio in millennial riders is less than or equal to long distance trip ratio in all bikers.

H0:Ly/Ay−L/A<=0H_0: Ly/Ay - L/A <= 0

Alternative Hypothesis:

Long distance trip ratio in millennial riders is less than or equal to long distance trip ratio in all bikers.

Ha:Ly/Ay−L/A>0H_a: Ly/Ay - L/A > 0

significance level: α=0.05\alpha = 0.05

Young riders were defined as millennials born after 1980, and long distance trips referred to trips greater than three miles from start to end station. In the exploratory phase of the analysis, the team reviewed the distribution of CitiBike ridership by birth year (Figure 1). The team then looked at the data pertaining to only long trips greater than three miles in distance for both age groups (Figure 2). The ratio of long trips for all riders was also explored, as shown in Figures 3 and 4. After the exploratory phase, the team began statistical testing. A Z-test was decided upon to test the hypothesis after peer review, and all riders were defined to be the population while millennial riders were defined as the sample to be tested. The ratios of each subgroup defined in the hypothesis were calculated and then tested.

Results

Looking back at the data, although young riders have a larger percentage of total trips (Figure 1), younger riders have a relatively small long-distance trip ratio (Figure 2). Therefore, when the Z-test was performed, our p-value indicated that the null hypothesis could not be rejected at a 0.95 significance level. The results of this data analysis reveals that millennial CitiBike riders do not have a significantly higher ratio of long distance trips, and therefore trip distance is not dictated by rider age.

Link to ipython notebook on GitHub: \href{https://github.com/nmonarizqa/PUI2016_nm2773/tree/master/HW6_nm2773/HW6_Assignment2.ipynb}