Exploring the CItibike trip duration differences between single-time customers and subscribers

Abstract

This report is based on my PUI Assignment 2 on Homework3 about a Citibike analysis with the python tool. The goal is to explore the Citibike trip duration difference between one-time customers and subscribers in terms of the MTP (Mean Trip Duration). The idea is that to prove that the average trip duration of single time customers is more than that of the subscribers, and further concludes that a single time customer would make better maximize the utilization than a subscriber.
 
A hypothesis is established below. As the samples are not equal,  a further Two-sided T-test is implemented, the results support the hypothesis.
Keywords: 
CitiBike Data, Data Wrangling, Null Hypothesis, Alternative Hypothesis, Statistical Significance Level, Two-sided t-test
Hypothesis
 
Null Hypothesis
H0: T(customer) <= T(subscriber)
The mean trip duration of single time customers over a week is less than or equal to the mean trip duration of the subscribers over a week
 
Alternative Hypothesis
H1: T(customer) > T(subscriber)
The mean trip duration of single time customers over a week is more than the mean trip duration of the subscribers over a week.
 
Statistical Significance Level
Significance level: α = 0.05
A significance level alpha(α) is chosen here  to reflect how significant the hypothesis testing will be at the end of the  test.

Data Analysis

The data was collected from the CitiBike_Data_Website for the trip duration from both one-time customer and subscriber. Later, this data was used to clean, organize, select, analyze, plot and visualize. First, the Null and Alternative hypothesis were established with a statistical significance level at 0.05, and then the data was collected, tabulated, cleaned, and reshaped.

The analysis is conducted by applying Pandas and DataFrames to the Python to get the mean trip duration for Customers(one-time user) and Subscribers respectively. The figures are plotted by using Matplotlib accordingly. Meanwhile, as t-test applies for testing the difference between the samples when the variances of two normal distributions are unknown, which fit in the situation,  the distribution of data is subjected to a two-sided t-test.


Figure 1: Average Trip Duration for Different User Type
Figure 2: Average Trip Duration for different user types over a week period