Data Mining Capstone Task 1 January 6,2018 1 Abstract In this task, I used python toolkits to perform the analysis,to get an overview of the topics of the reviews and the review stardistribution. In particular, I used packages such as gensim, sklearn for thetopic extraction. For data visualisation, I used D3to do all the drawings. 2 Implementations 2.
1 Topic mining of allrestaurant reviews (Task 1.1) To understand what people aretalking about among the review data, I used topic model LDA to extract 10topics from all the reviews that are for restaurants. Iapplied TfidfVectorizer to vectorizethe raw review data, after filtering all reviews and only collect those forrestaurants. The TF transformation was made to be linear, and I enabled IDFreweighting. I specified the n-gram range to be 1 to 2, i.e. terms with 1 ortwo words were collected.Byusing D3, I applied word cloud to visualise the data.
Essentially, each cloud represents a topic. The font size shows thesignificance of the corresponding term in this topic. Ten colours are used todistinguish the topics. Here are several observations: 1.
Cuisines / foods are very popular topics –topics 4, 5, 7, 9 more or less emphesize a particular cuisine, e.g. Mexicantacos, pizza, Japanese sushi, and Thai / Chinese food, while topics 0, 6, 8emphesize certain foods or drinkings. 1 (a) Topic 0 (b) Topic 1 (c) Topic 2 (d) Topic 3 (e) Topic 4 (f) Topic 5 (g) Topic 6 (h) Topic 7 (i) Topic 8 (j) Topic 9 Figure 1: Ten topics minedfrom all restaurant reviews 2 2. General comments (usually good) to therestaurants – topics 1 and 3 clearly show a good impression, while topic 2possibly suggests an inferior impression in terms of waiting (and then askingfor managers).
In particular, topic 2 also suggests that time is a very important topic, or factor, when customers arereviewing a restaurant. 2.2 Topic mining of positiveand negative reviews (Task 1.2) In this task, I managed toexplore the topic distribution for subsets of all the reviews. In particular,this section introduces the observations made upon subsets of negative reviews(reviews with star number <= 2) and positive reviews (reviews with starnumber >= 4). Thetopic model used is still LDA, with identical configurations as in task 1.1.Figures 2 and 3 show the result.
From these results, it is shown that: 1. Both positive and negative reviews stilltalk much about food or cuisines themselves frequently. For instance, innegative review topics, tacos, crab legs, sushi, carne asada, pizza, hot dogsare the top topics. While for positive reviews, still pizza, sushi, Indianfood, chicken, breakfast, oysters are mentioned. 2. Compared with all reviews, these subsetsnow shows something differ-ent in different subsets.
• In thepositive review topics, there is no negative impression when looking at the topwords in each word cloud of each topic. And even for the positive topics, thephrasing are quite general (“great place”, “amazing”, “definitely coming”) andnot touching specific factors that influence customers’ rating. • Howeverin the negative review topics, many specific terms indi-cate what the customersare looking for – “portion size”, “short ribs”, “limited menu”, “service”,”time”, i.e. amount of food, menues, service, and time.
At the same time,general comments (e.g. “just okay”) can still be seen. 3 (a) Topic 0 (b) Topic 1 (c) Topic 2 (d) Topic 3 (e) Topic 4 (f) Topic 5 (g) Topic 6 (h) Topic 7 (i) Topic 8 (j) Topic 9 Figure 2: Ten topics minedfrom negative restaurant reviews4 (a) Topic 0 (b) Topic 1 (c) Topic 2 (d) Topic 3 (e) Topic 4 (f) Topic 5 (g) Topic 6 (h) Topic 7 (i) Topic 8 (j) Topic 9 Figure 3: Ten topics minedfrom positive restaurant reviews5