1. Introduction
In this study, we attempt to gain a better understanding about the dynamic building energy consumption in New York city through regression models and clustering techniques. To predict energy consumption at the building level, we used are Support Vector Machine, Random Forest, and K-nearest Neighborhood. For clustering, we want to group buildings based on their energy performance and detect trends in annual energy consumption. From there, we identify similar characteristics shared by the buildings within each cluster, which can contribute to regulations and policy decision in energy management of the whole city. In this case, K-means is applied to the four-year of general energy consumption.
2. Literature Review
There is a great variation in energy consumption in both residential and commercial buildings, which is subjected to the buildings’ construction characteristics, locations, technical design parameters, occupant behaviors. The U.S. Energy Information Administration shows that the energy consumed in residential and commercial building accounts for 40% of total U.S. energy consumption. Hence, any reduction or control in building energy consumption can have a great impact on the total energy consumption.
The study (Kontokosta, 2015) that did similar work used PLUTO data, LL 84 data, along with other dataset such as COSTAR dataset. In that study, they have better result for the regression prediction part since they have access to more dataset than we are. The best result from their study is 30% as the R2 score whereas the best result we have is up to 20% using random forest regression algorithm. Furthermore, another project from a computer science class (“CS109 Project - Building Energy Consumption Prediction,” n.d.) did energy prediction with weather features using similar methods, since we are aiming at predicting annual weather normalized data, we take their methods under consideration on how to improve our prediction result with our own dataset.
Other than the prediction, we did clustering for the time series data along with the analysis of building characteristics in each cluster and we found some interesting results discussed in the result part later. Meanwhile, we used both SVM and Random Forest method to generate the list of the important features, and interestingly, the lists are different for different methods as well as building types. By using just selected features, our accuracy score of regression raised up to 20%.
3. Data
We used New York City’s Local Law 84 (LL84) data (Department of Citywide Administrative Services, 2016) provided by the NYC’s Department of Finance as a results of Energy Benchmarking requirement that commercial, residential or mixed-use buildings, whose gross squares are 50,000 or more need to report the annual energy and water consumption to the city. We also utilized Pluto dataset(Department of City Planning (DCP), 2016), which provides us insights into the building characteristics, including but not limited to the year built, number of floors, lot area and the percentage of each fuel type.
For the regression part, we used dataset from LL 84 and Pluto in 2015. After merging the two data set, we decided to use “Weather Normalized Site EUI (kBtu/ft²)” as our dependent variable. As for the independent variables, we refer to Professor Constantine’s related work(Kontokosta, 2015) along with consultation with him personally to determine them. The features selected are shown below with the data source provided in parenthesis:
- BuiltFAR: The Built Floor Area Ratio (FAR) is the total building floor area divided by the area of the tax lot. (PLUTO database)
- YearBuilt (PLUTO database)
- DOF Property Floor Area (ft²) (PLUTO database)
- LotType: A code indicating the location of the tax lot to another tax lot and/or the water. (0: Mixed or Unknown, 1: Block Assemblage, 2: Waterfront, 3: Corner, 4: Through, 5: Inside 6: Interior Lot, 7: Island Lot, 8: Alley Lot, 9: Submerged Land Lot) (PLUTO database)
- ProxCode: The physical relationship of the building to neighboring buildings. (0:Not Available,1: Detached, 2: Semi-Attached, 3: Attached) (PLUTO database)
- Zip Code (LL84)
- LotArea: Total area of the tax lot, expressed in square feet rounded to the nearest integer. (PLUTO database)
- NumFloors: Number of floors (PLUTO database)
- Oil/Diesel/Water/Gas/Electricity: A binary variable equal to 1 for the dominant fuel type/energy source in the building, and equal to 0 otherwise. A fuel type/ energy source is considered dominant if it accounts for more than 50 % of the building’s total site energy consumption. (PLUTO database)
With independent variable and dependent variables ready, we handled the outlier data by removing the data that is two standard deviations away from mean of the natural log value of the dependent variable. Meanwhile, to improve the result of our regression models, we divided the dataset based on the building types into two groups: Commercial, Residential. After removing the outliers, commercial dataset has 667 records whereas residential dataset has 5745 records.