Abstract
Exploratory data analysis is performed to determine if there is a statistical difference between the average trip duration of riders who are older than 30 years compared to riders who are 30 years old or less. Using a T test we were able to reject the null hypothesis within 95% confidence that riders over the age of 30 have the same or shorter trip durations than riders 30 years or younger.
Data
The data used for this analysis was obtained from CitiBikeNYC.com. The data included historical trip information for CitiBike riders in the New York area. This analysis focused on June of 2016. The data came in the form of a CSV and was loaded into a pandas Dataframe for further analysis.
Analysis
The first step required calculating riders ages based on their birth year. Next the data needed to be split into riders 30 years or younger and riders over 30 years old. Then outliers in the data were removed. Outliers were determined to be any trips longer than 8 hours and riders over the age of 100. Once the data was split and cleaned, histogram and box plots were created to visually assess the differences in the distribution of the two datasets. Next a T test was used to determine if the two datasets were statistically different from each other. The T test was selected because we are comparing two independent samples but the population standard deviation is unknown.
Result (description of the result and conclusion from the analysis)
The t test returned a p value of < .000001 indicating we reject the null hypothesis that average trip duration for riders over 30 is the same or less than the average trip duration of riders 30 or younger. While this would indicate that the difference between means is statistically significant, the effect size is only .017 which means there is a very small actual difference in means.