• Home
  • About
    • Angie Shen photo

      Angie Shen

      Duke Senior, Statistics Researcher

    • Learn More
    • Email
    • LinkedIn
    • Github
  • Posts
    • All Posts
    • All Tags
  • Research
  • CV

Yelp Data Challenge

13 Aug 2017

Our analysis of the popular Yelp dataset consists of two parts: ratings prediction and visualizing food trends around the world.

1. Ratings Prediction

We aim to predict the ratings of a review based on the text of the review. As the data is very big, we restrict our analysis to restaurants in Charlotte, which has a total of 122,322 reviews.

We choose our features to be words that are the most predictive of the rating, i.e. words that convey a strong negative or positive sentiment such as “great” and “awful”. First, we extract salient words from all reviews by calculating the TF/IDF score for all words.

Second, we matched our list of words with a list of words which convey strong positive and negative sentiments provided by the Multi-Perspective Question Answering (MPQA) Subjectivity Lexicon at University of Pittsburgh (http://mpqa.cs.pitt.edu/lexicons/subj_lexicon/). There are a total of 2,874 words in our reviews that convey strong positive and negative sentiments. To reduce the dimension of our feature space,we choose 1,000 words with the with the highest TF/IDF score as some of the words with low TF/IDF score, such as “ignominiously” and “sanguine” do not appear in many reviews would not be good features for prediction. Figure 1 shows word clouds of 200 positive words and 200 negative words with the highest TF/IDF. Some of the highest score negative words were: little, bad, disappointed, long, extremely, bland, rude, terrible, cheap. Some of the highest score positive words were: great, amazing, awesome, love, best, friendly, nice, delicious.

Third, we use this list of 1,000 words to create a matrix of counts, i.e. we count the number of times each word appears in each review. We end up with a 122,322 by 1,000 sparse matrix. We filter out the all-zero rows. Figure 2 shows a histogram of the numbers of feature words each review contains. We can see that most review contains fewer than 10 of the feature words. The feature matrix is sparse.

We divide the data into 90% training set and 10% test set. We end up with 34,638 observations in the training set and 3,848 observations in the test set. We used the following regression algorithms:

2. Visualize food trends around the world

We visualize what food categories were most popular for each state, as well we for different countries in the Yelp dataset. We visualize the food preference in US cities (by state) and international cities (by country) through pie charts and geographical maps. We define most popular as having the most reviews and also the highest average stars ratings. We also visualize how many people review cheap and expensive restaurants by state and country.

Here is a summary of interesting findings:

US: ● Americans Yelp a lot about “breakfast & brunch” ● Food preference is relatively similar across the US (compared to food preference across different countries) - favorite categories are “bar”, “nightlife”, “American (new and traditional)”, “breakfast & brunch”, “pizza” ● Urbana-Champaign (Chicago area) and Pittsburgh like Pizza more than other cities ● “Nightlife” is more well-liked than “Bar” in Las Vegas, Charlotte, and Pittsburgh

International: ● Nightlife is popular in all the cities except Montreal ● Europeans don’t like breakfast at all ● Canadians also like breakfast and brunch a lot (similar to the US) ● Waterloo, ON likes Japanese food even more than breakfast ● Germans like Italian food (but not as much as German food) ● Unsurprisingly: Edinburgh has a separate “pub” category besides “bars”, Montreal favors French food, Edinburgh likes tea, Germans and Brits drink a lot…



Like Tweet +1