EC Project Report

Abstract


Some house buyers sometimes do not know if their wanted houses sell at a premium. Especially for the foreign, they do not know the history of the local house price. The research question is about to find out how do home features add up to its price tag. In addition,  the research aims to use correlations between price and home features to build a prediction model.

Data


The data describes every aspect of residential homes in Ames, Iowa. It is stored in CSV format and it contains 1460 records with 79 explanatory variables.

Methodology


The project used LASSO(least absolute shrinkage and selection operator) to train the prediction model. LASSO is a regression analysis method that performs both variable selection and regularization. It is a regression method that could penalize the absolute size of the regression coefficients. Some regression coefficients might be exactly zero by penalizing. It is convenient to select feature or variable.

Implementation


There are three main steps to implement prediction model. Firstly, we need to preprocess data before training. The project transformed the skewed numeric features into normal distribution using log function. We used python built-in method to check if the feature is skewed. We used one-hot encoding method to deal with categorical features. It transforms each categorical feature with n values into n binary feature.After that, we replaced missing values with the average of respective features.

Secondly, we need to choose optimal alpha for our LASSO model. Alpha provides a trade-off between balancing RSS and magnitude of coefficients. We decide to use cross-validation to find the optimal alpha, which means that model with optimal alpha would have better scores of their inner sample R-squared and outer sample R-squared. 

Finally, we put our training data into the model and test its performance.

Analysis


We firstly want to find out what the most important positive coefficients are. The positive coefficient means that the feature would increase the house price. The negative coefficient means that the feature would decrease the house price.We list top 10 most positive coefficients and top 10 negative coefficients.
Figure 1: Coefficients in LASSO model