Table of Content 1. Introduction 2. Literature Review 3. Data 4. Methodologies 4.1 Prediction 4.1.1 Random Forest 184.108.40.206 Random Forest Regression 220.127.116.11 Random Forest Feature Importance 4.1.2 K-nearest Neighbors Regression 4.1.3 Support Vector Machine 18.104.22.168 Support Vector Machine Regression 22.214.171.124 Support Vector Machine Feature Selection 4.2 Clustering 4.2.1 K-means Clustering 5. Results 5.1 Random Forest 5.1.1 Random Forest Regression 5.1.2 Random Forest Feature Importance 5.2 K-nearest Neighbors Regression 5.3 Support Vector Machine 5.3.1 Support Vector Machine Regression 5.3.2 Support Vector Machine Feature Selection 5.4 K-means Clustering 5.4.1 Energy Consumption 5.4.2 Yearly Trends in Energy Consumption (2012 - 2015) 6. Discussion 6.1 Conclusions 6.2 Limitations and Future Work 7. Reference
PUI2016 Citibike Project Summary ABSTRACT: In this project we looked at whether on average older individuals (over 40 years old) used Citibikes for shorter trips than younger individuals(less than 40 years old). Using information on trip duration and rider age for the month of February 2015, we ran a Z-test test for the proportions grouped by trip duration, yielding at statistic of 26.09. In this case we will reject the null hypothesis and conclude that older individuals are more willing to take shorter trips. DATA: We used the zip file on the Citibike's website corresponding to the month of February 2015. The data can be downloaded here: https://s3.amazonaws.com/tripdata/201502-citibike-tripdata.zip The corresponding .csv file contained entries for the start and stop station location, trip duration, customer type, birth year and gender of each rider during the month. We extracted age by subtracting the birth year of subscribers from the then current year 2015, and dropping all entries except trip duration and age. We split the pandas dataframe into those over and under 40 to create 2 samples. Then we divided the trip duration into two categories as short trip(less than 10 mins) and long trip(more than 10 mins) (see Figure 1). At last we normalized the distribution(see Figure 2).