AbstractThe problem chosen is to use machine learning techniques to build a model to predict house prices based on a data set. The data set consists of sales prices in the five boroughs of New York City in the year 2016. The models that have been trained are Random Forest and Gradient Boosting. The Random Forest Model gives an out sample R 2 of 0.6701 and the Gradient Boosting model gives an out sample R 2 of 0.5540. IntroductionData science is used extract knowledge from data. Data Analytics and Machine learning can be applied on historical sales data to understand how the value of a house is determined. What features of a house determine it's price? This is one of the questions asked by a buyer or a property assessor. The house price depends on the number of rooms, number of garages, presence of a swimming pool ,land use area etc. But the price also depends on the neighborhood and the sales price of a similar houses . For example a house in Manhattan near Central Park costs more than a house in Brooklyn. Hence location and demographic features of a neighborhood will also affect its price. Previously many machine learning techniques have been used for prediction of house prices like multiple Ordinary Least Squares, CART models and deep learning models . Machine learning techniques Random Forest and Gradient Boosting have been utilized to get predictions by building models that take all these factors into consideration as features. Data House Prices have been taken from NYC Department of Finance for the year 2016. This dataset consists information about sales price,land square area, gross area, year built, building category,tax class, zip code etc. Zip code shape file has been taken from NYC Open Data and consists of geometric information about all the zip codes in NYC. Demographic information has been taken from American Fact Finder. Tables of the American Community Survey 2016 estimates for the state of New York have been used for the analysis. The data-sets consist of mean income level, school enrollment, number of people with bachelors degree or higher and number of employed people tabulated at the zipcode level.
AbstractThis study is a statistical analysis of CitiBike open data. The idea we had was that young people are more likely to be subscribers of CitiBike. To test this hypothesis we used the Mann-Whitney rank test on subsets of customers data and subscribers data. The p-value for the test was equal to \(1.5788174779838282\ e-231\) which was less than chosen significance level \(\left(\alpha\ =\ 0.5\right)\). Hence the null hypothesis was rejected.IntroductionCitiBike is a bike sharing service which offers it's users two service models : customers and subscribers. Subscribers pay a standard monthly fee for unlimited access to CitiBike. Customers make payments in a pay per ride model. Our idea was based on the fact that riders need a minimum fitness level to be able to use CitiBike as a primary transportation medium and hence get good value for subscription services. This analysis is also of potential marketing significance to CitiBike as a company.DataCitiBike Trip Data from the CitiBike website for the month of June in the year 2018 has been used for the statistical analysis. The data set consists of information like trip duration, start time and end time of the journey, start station and end station information, bike id, user type , birth year and gender for all the trips recorded in June. The data has been read into a data-frame and the data-frame is reduced to keep only columns of interest which are user type and birth year. Age is calculated from the birth year and null values are dropped. Two separate data-frames are created for the customers and subscribers.