Spatial Homogeneity of NYC Neighborhoods

Shalmali Kulkarni, skulkarni2, sck408

This paper is about the neighborhood clustering of Manhattan based on physical environment characteristics. The research project was done as a part of curriculum for Principles of Urban Informatics Class 2016.


This paper studies the spatial and social homogeneity of neighborhoods to better understand socio-spatial characters of different neighborhoods in Manhattan, New York. The analysis uses two clustering techniques –Kmeans and DBscan to understand the cluster homogeneity. The results show that DBscan technique performs better than KMeans clustering to reveal that Manhattan shows a distinct pattern based on physical characteristics (diversity) of the buildings.


KMeans Clustering, DBscan clustering, Manhattan.

INTRODUCTION (Relevance of the Study)

The relation between urban form and social heterogeneity have a long history in research of cities. Socio-spatial inequality in NYC has been discussed by city planning, economic development and social justice for a long time. This research clusters the zip codes in Manhattan, New York based on the physical characteristics of the built environment and population distribution. The study tries to understand the clusters defining the famous Manhattan skyline.


The datasets used for this research – PLUTO data and Zip code geojson file for mapping. This study is done for all the zip codes in Manhattan and could later be performed on all the boroughs of New York City. The PLUTO Data 2016 for Manhattan was downloaded Open NYC website. This data has extensive land and geographic information in about 70 attributes such as zoning, land use category, lot area, building area, building frontage, building depth, etc. The final zip code file was also downloaded from NYC Open data Zip Code file


The data wrangling process follows the idea of reproducibility and includes the following stages:

PLUTO dataset

  • PLUTO data for 2016 version 2.0 was downloaded in both ‘csv’ and ‘shapefile’ formats. The shapefile was used to create maps.
  • The data was reduced by selecting the attributes require for the research, from the list of 70 attributes. There are many attributes which could be used for this research but only four were selected for the project.
  • The selected attributes are – Number of floors, building frontage, building depth, irregular lot code.
  • All the attributes were grouped by zip code to perform analysis as zip code level.
  • The standard deviation for each zip code was used instead of mean to give an understanding of the level of variation from the mean of the zip code. For e.g. the standard deviation of the number of floors for zip code 10001 is 7.798 that means the buildings in this zip code varies about 7-8 floors from the mean number of floors. Low number for standard deviation means all the buildings are very close to the mean of the building.
  • All these attributes are scaled (whiten the data) using the ‘MinMaxScalar’ from sklearn preprocessing package. This scaled the all the attribute to a common comparable scale. This helps to remove the correlation between the attributes i.e it decorrelates the data so that the covariance matrix is an idenetity matrix.
  • A sum of all the standard deviations was calculated as a ‘diversity score’ for each zip code.

Zip code shapefile

  • The geojson file was downloaded to help create maps as this file contains all the necessary geographic information. *This file was merged with the PLUTO data.

A final dataset including all the required fields was exported as csv file for later use.


Exploratory Analysis:

A colormap for each of the attribute was made to understand the emerging patterns. The Number of floors map(ref: Figure 1) shows a highest building height in financial district of lower Manhattan and the mid-town. As the map is plotted using the standard deviations it shows the variations clearly. The 'diversity score' (ref: Figure 2) defined as the sum of all the normalized standard deviations of attributes considered visually shows some cluster formation.

This research uses two clustering techniques from the sklearn package for further analysis.

  • K-means Clustering:

The Kmeans algorithm clusters the data into groups of equal variance. This algorithm requires to specify the number of clusters. Silhouette analysis was performed to get the optimal value for number of clusters. This analysis suggested 3 or 4 number of clusters based to the maximum score. The study explores four clusters.

Silhouette Score for K-Means Clustering