Abstract: In this paper, I will try to find out outlier neighborhoods in terms of rent price, safety, and commuting time to city centers. I assume that market price in terms of rent is not all the time perfect and there could be some neighborhoods which are undervalued. Exploring NYC rent data, NYC crime data, and google map, I apply Local Outlier Factor approach, which was introduced by Breunig, Kriegel, Ng, & Sander (2000). As a result, I found seven outliers, which might be undervalued or overvalued and concluded that Little Italy, Battery Park, and Murray Hill (Manhattan) are good deals to live. On the other hand, East New York, Bedford Stuyvesant, and Tribeca might be overvalued. I should note a limitation. Since I set two specific destinations to calculate commuting time from each neighborhood, these conclusions, i.e. overvalued or undervalued, are not applied people who have to commute different destinations.  Introduction: Looking for affordable rooms, many people in NYC live outside city centers and commute to the centers. Generally speaking, rent price is correlated with distance to city centers and quality of the neighborhood. In this project, I will try to find outlier areas where rent is cheap but close to city centers and safe. This information about outliers is useful for people looking for investment opportunity on real estate and city agencies which care about areas where gentrification would happen near future by Adam Smith’s invisible hand.Using Local Outlier Factor, I will try to detect the outliers. First, I will collect and clean data. Then, I will explain what the Local Outlier Factor is and analyze the results I got. Then, I will explore the potential improvement of this research as a future work.The relation among rent price, crime and location has been already well researched. For example, Zhang & Hite (2015) performed regressions and found out that correlations between crime counts and housing price when people choose accommodation. However, researches which are applied high-dimensional clustering to investigate the relation among rent, crime and location are still undeveloped. The value of my research is that I conduct statistical research and perform 3-dimensional clustering approach to detect outlier rent, which is unique.
Picture for assignment2
Intuitively, we can assume that young people have more stamina and are more active than older people. In this experiment, we examine whether this assumption is true, using Citibike data, comparing the average trip distance of young and old riders. Carrying out t-test, we found out the difference is significant. However, we also found out that the older group has longer average trip distance than that of younger, which is not intuitive. This implies that older group has significantly longer average trip distance than that of younger. As this is not intuitive, we tried the same approach using different dataset. As a result, we did not see the significant difference between two sample means. I didn’t change the initial Alternative hypothesis (the younger, the more active) to be hypothesis-driven.  Introduction -Generally speaking, young people are more active than the old. However, the older are becoming more and more active than ever because of rising health consciousness and high-quality healthcare system. In order to give useful implication for these discussions, we will examine whether young Citibike riders have longer average trip distance than that of older riders.Citibike is docked-sharing bike dotted around NYC and the usage data is open to the public. Based on the calculated ages of the riders, we set Null hypothesis – H0: Older riders (age 31~) have same or longer average trip distance than that of younger riders (age 0 ~ 30).Alternative hypothesis – H1: Younger riders have longer average trip distance than that of older riders.Significant level : 0.05 Data –Citibike data includes information of users’ birth year, start and end station, date, time, etc. We use Jan 2015 data. Then we split the data into two groups, e.g. Young: age same or under 30, Old: age over 30. We dropped the data which doesn’t contain age information. Trip distance is calculated by using Pythagorean theorem as we know only start and end station latitude and longitude instead of distance.As the result of calculation, the average trip distance for Over30 is 0.01488 and Under30 is 0.01439.