Determining Factors that Affect a Restaurant’s Yelp Rating

Abstract

This project aims to find factors that affect a restaurant’s Yelp ratings. We focus our analysis on restaurants in Pittsburgh, Pennsylvania only. We use Lasso Regression to select the best predictors among a variety of business features. Considering the spatial differences of every restaurant, we perform spatial analysis as well to find out the cluster locations of restaurants with high ratings on Yelp. We also verify the results by comparing their hot-spot analysis with those of our target variable, Yelp ratings. If there is an overlap, then the feature is considered to be a valid factor that influences a restaurant’s ratings.

Keywords: Logistic regression, principal component analysis, lasso regression, hot spot analysis, kernel density

Introduction

Yelp is well-known for its ability to provide unparalleled word-of-mouth advertising to small businesses. A 2012 study by two UC Berkeley economists, Michael Anderson and Jeremy Magruder, showed that an increase in a restaurant’s Yelp rating from \(3.5\) to \(4\) stars can result in as much as a \(19\) percent increase in bookings during peak hours.

The purpose of our project is to find the top features that can affect a restaurant’s Yelp rating. This will allow us to provide essential insights for restaurant owners on how to improve their Yelp ratings and hence, improve their businesses as a whole. Yelp can also use this information to evaluate new restaurant listings.

Hypothesis

We define a feature to have a significant impact on a restaurant’s Yelp rating if its coefficient in a Lasso regression has a p-value below a significance level of \(0.05\). We also hypothesize that the hot spot analysis and kernel density maps for significant features to overlap with that of the average rating across Pittsburgh census tracts.

Data Manipulation

Data Collection

The primary data used for our analysis is provided by the Yelp Datasets Challenge, which comes in five json files: \(business\), \(reviews\), \(check\)-\(in\), \(user\), and \(tips\). We only use the business, reviews, and check-in files, which contain descriptive data about the restaurant, such as location, type of cuisine, parking availability, etc., user reviews of the restaurant, and the number of check-ins by the time of day and the day of week, respectively.

The Pittsburgh census tracts shapefile along with other shapefiles such as Pittsburgh river shapefile and Allegheny county census tracts are downloaded from the Department of City Planning, City of Pittsburgh.

Data Wrangling

We begin by converting the \(business\) json file to a pandas DataFrame. This will serve as our main DataFrame. We select only restaurants from Pittsburgh, Pennsylvania. We flatten nested dictionary from the json file into one-line string format and convert missing data into their appropriate format in Python, which is \(numpy.nan\).

We extract the occurrence of keywords such as good, bad, or excellent from each review for each restaurant in the \(reviews\) json file. We also extract the corresponding \(business\) \(id\) and combine them to form a two-column DataFrame. We do the same thing to extract the number of check-ins for each restaurant from the \(check\)-\(in\) json file. We merge the two resulting DataFrames with our main DataFrame by their common \(business\) \(id\) column using the merge function from pandas.

Our data contains a lot of categorical features stored in string format. This is not ideal for machine learning algorithms. Therefore, using the LabelEncoder function from sklearn’s preprocessing module, we transform the string values in our categorical features into integers. The complete definition for each feature can be found in the codebook.

Finally, we add a \(geometry\) column to our data by combining the \(longitude\) and \(latitude\) columns into a shapely.geometry.Point object. We convert our DataFrame to a GeoPandas DataFrame, which allows us to output our cleaned data as a shapefile that can be used for spatial analysis.

Regression Models

Logistic Regression

We begin our analysis by assessing the ability of our features to predict our target variable by performing a logistic regression. We compress our data using principal component analysis with \(n=8\). Then, we use the train_test_split function from sklearn’s model selection module to divide our data into training and test sets.

We train our logistic regression model on the training set with regularization parameter set to \(10,000\). We evaluate our model using the test set. The accuracy of our model came out to be \(30.46\%\). The precision of our model is heavily influenced by the number of cases for each rating. Since most restaurants have a Yelp rating of \(3.5\) or \(4.0\), our model tend to be significantly more precise with precisions up to \(67.13\%\) for these cases than it does for more extreme cases such as a Yelp rating of \(1.0\) or \(5.0\).

Lasso Regression

We use the train_test_split function to further divide our data into training, test, and validation sets. We train our lasso regression model on the training set. We set the hyperparameter \(\lambda\) of the regression to the value that maximizes the average out of sample \(R^{2}\) after performing cross-validation on our data set a thousand times. The resulting average out-of-sample \(R^{2}\) is \(0.7822\), which is extremely high.

We check the coefficients from the regression for both highest absolute value and most significant p-values and found the following eight features to from our data to have significant impact on a restaurant’s Yelp rating: \(Attire\), \(Price\) \(Range\), \(Delivery\), \(Coat\) \(Check\), \(Corkage\), \(Takes\) \(Reservations\), \(Waiter\) \(Service\), and \(Excellent\). These features correspond to a restaurant’s dress code, price range, existence of delivery, coat racks, corkage, reservations, waiter service, and finally, the occurrence of the word “excellent” in the restaurant’s reviews, respectively.

Spatial Analysis

Hot-spot Analysis

We spatially join the resulting shapefile from the data wrangling section with the Pittsburgh census tract. We dissolve the average value of each feature into the census tracts. Then, we perform a hot-spot analysis using ArcGIS on the average Yelp rating. The results of this analysis can be found in figure 1. The GiZScore in the figure represents \(z\)-scores for the feature tested, in this case, the average Yelp rating. For statistically significant positive \(z\)-scores, the larger the \(z\)-score is, the greater the intensity of the clustering of high values (hot spot) are. Conversely, for statistically significant negative z-scores, the smaller the \(z\)-score is, the more intense the clustering of low values (cold spot) are.

The hot-spot analysis serves as a way to verify the features selected from the regression. By performing hot-spot analysis on each of the selected features, we can check if the location of hot and cold spots overlap with that of figure 1. A match would indicate that the feature has the same spatial distribution as the Yelp ratings, thus a good predictor. However, this does not seem to be the case as results from our hot-spot analysis do not overlap.

Kernel Density Analysis

We also create a kernel density map of restaurants distribution on star levels. It calculates the mean star level of restaurants, locate it and gets the distances between the other points and the mean center, then calculates the search radius by standardized distances of the median statistic. The results of this analysis could be seen in figure 3 and figure 4. The process calculates the density of restaurants around the raster cell of Pittsburgh Census Tracts, and there are two density maps of all the restaurants and the ones above \(4\) stars. By comparing with them we could find out the different locations of restaurants with different star levels.

Conclusion

Based on our analysis, we can conclude that the eight features we found have a significant impact on a restaurant’s Yelp ratings because of their significant and high absolute-valued coefficients under a significance level of \(0.05\).

Based on our analysis of spatial parts, we can find that locations may reflect star levels of restaurants. The clusters of hotspots and restaurants above \(4\) stars are mainly distributing in places of CBD and blocks alongside the rivers. It means that these areas are more likely for restaurants to be rated high on Yelp.

However, our proposed method of using hot-spot analysis as a way to verify our results failed as neither maps of all of the features matched with our target variable.

Contributions

Kevin Han: Clean data and transform categorical data from string format to integers. Extract Yelp reviews and check-ins. Interpret hot-spot analysis results. Write the final paper.

Cheng Hou: Propose the project goals and the data analysis methods that needed. Merge and convert the original data to the available format. Run the regression models. Write the final paper.

Xiaomeng Dong: Propose project theme, find data and perform spatial analysis of hot-spot and kernel density. Write the final paper.

Yue Cai: Get involved with sorting and analyzing data, contribute ideas to the final paper and finish the codebook.

Appendix

Figures

See attached.

Hot-Spot Analysis for Average Yelp Rating in each Census Tract

Hot-Spot Analysis for Occurrence of the Word “Excellent” in Yelp Reviews

Kernel Density Map for All Restaurants

Kernel Density Map for Restaurants above (not exclude) 4 Stars

Codebook

The codebook can be found via the following link: http://github.com/yc2839/onemore/blob/master/codebook.txt.

[Someone else is editing this]

You are editing this file