# Abstract

This report is based on my PUI Assignment 2 on Homework3 about a Citibike analysis with the python tool. The goal is to explore the Citibike trip duration difference between one-time customers and subscribers in terms of the MTP (Mean Trip Duration). The idea is that to prove that the average trip duration of single time customers is more than that of the subscribers, and further concludes that a single time customer would make better maximize the utilization than a subscriber.

A hypothesis is established below. As the samples are not equal,  a further Two-sided T-test is implemented, the results support the hypothesis.
Keywords:
CitiBike Data, Data Wrangling, Null Hypothesis, Alternative Hypothesis, Statistical Significance Level, Two-sided t-test
Hypothesis

Null Hypothesis
H0: T(customer) <= T(subscriber)
The mean trip duration of single time customers over a week is less than or equal to the mean trip duration of the subscribers over a week

Alternative Hypothesis
H1: T(customer) > T(subscriber)
The mean trip duration of single time customers over a week is more than the mean trip duration of the subscribers over a week.

Statistical Significance Level
Significance level: α = 0.05
A significance level alpha(﻿α) is chosen here  to reflect how significant the hypothesis testing will be at the end of the  test.

# Data Analysis

The data was collected from the CitiBike_Data_Website for the trip duration from both one-time customer and subscriber. Later, this data was used to clean, organize, select, analyze, plot and visualize. First, the Null and Alternative hypothesis were established with a statistical significance level at 0.05, and then the data was collected, tabulated, cleaned, and reshaped.

The analysis is conducted by applying Pandas and DataFrames to the Python to get the mean trip duration for Customers(one-time user) and Subscribers respectively. The figures are plotted by using Matplotlib accordingly. Meanwhile, as t-test applies for testing the difference between the samples when the variances of two normal distributions are unknown, which fit in the situation,  the distribution of data is subjected to a two-sided t-test.

Figure 1: Average Trip Duration for Different User Type
Figure 2: Average Trip Duration for different user types over a week period
Figure 3: The distribution of the Subscribers' to Customers' Mean Trip Duration during the weekend

# Results/Conclusion

According to the data analysis and hypothesis testing above, the results validated the setting of the experiment and its hypothesis. As in the T-test, the ﻿H0 was rejected under the statistical significance, thus it could be concluded that Customers who are mostly one-time users have longer Citibike trip duration than the Subscribers who are mostly multiple-time users. Meanwhile, the visualization for the trip duration differences between two groups of users both on weekday base and weekend base also supported the rejection of H0.
Therefore, it is proved that the average trip duration of single time customers is more than that of the subscribers, and further it could be concluded that a single time customer would make better maximize the utilization than a subscriber.
Further researches can focus on extracting and comparing data from different months to explore the reason behind this trip duration differences between two groups.

# References

1.    Github - Federica Bianco
2.    Statistics in a Nutshell by the O'Reilly Publishing House.