Extract User Reviews using Python Pandas

Extract User Reviews using Python Pandas

TripAdvisor user reviews data about a particular hotel. To help the hotel understand the feedback the reviews provide, and what it might suggest they should focus on to improve customer experience.  In part I, data will be extracted for each reviewer’s ratings of a hotel along with a summary.  The ratings and reviews will be saved in separate files that others can use. The raw data is in json files but other Python data structures like dicts, json objects, and pandas DataFrames and Series will all be used. In Part II, extracted hotel information from a set of json files, saving it into a DataFrame.

The Data

The user reviews data used in Part I of this Exercise are in the json file 100506.json. This json file contains data in “nested Python dicts.” That is, the user reviews data values are in Python dictionaries, some of which are contained in other dictionaries. Python dictionaries (“dicts”) are objects that store key/value pairs. The “values” can be various kinds of things, including lists, arrays, and other dicts. The keys need to be unique: there can’t be duplicates.

The user reviews data includes some information about the hotel, and a number of reviews of it by people who (we assume) stayed at it.

These user reviews data have been made available by the researchers of the LARA project, http://www.cs.virginia.edu/~hw5x/dataset.html

Objectives

Part I

  1. Extract the reviewer data from the json file into a Pandas DataFrame with reviewers in the rows, and the numerical ratings, review date, and review author name in columns.
  2. Calculate the mean, and the minimum and maximum for each rating.
  3. Save the numeric ratings data as a DataFrame in a pickle file in a shelve DB.
  4. Save the reviewers’ comments as text data indexed by reviewer name. Include with each written review its review date.

Part II

Processing additional json user reviews data files to parse the “HotelInfo” data in all the json files into a single Pandas DataFrame that is suitable for subsequent anlyses.

Working with the 100506.json file (Part I) and the other json files (Part II)

​jsondat is (or should be) a dictionary of dictionaries. It includes information about the hotel, and also user reviews written by customers that include text and ratings.  User reviews data is in jsondat[“Reviews”]. User reviews consists of some number of reviews by people who have visited the hotel. Each user review is a dictionary that includes the reviewer’s name (“Author”), some ratings that are in their own dict (“Ratings”), a review date (“Date”), and possibly some descriptive text. Extract the authors’ names, their ratings, and their review dates from jsondat, and put them into a Pandas DataFrame.  The ratings need to be numeric so that they can be used to calculate some descriptive statistics. Save the results so they can be retrieved or shared them with others and include the  reviewers’ text comments along with their author names and their review dates.

Part I

Create two data “objects” and save them.  One is a pandas DataFrame with user reviews hotel ratings in it. The other (type not specified; a DataFrame might suffice) includes reviewers’ comments about the hotel as text data.

Note that the data (the “values”) in the json data is Unicode character data. (That’s what those “u’s” indicate before the character strings.) Therefore, will need to convert the ratings data to be numeric.

Part II

Expand the number of files to include different hotels.  Include these hotels and parse the “HotelInfo” in all the json files into a single DataFrame for use in subsequent analyses via looping files in a directory.

IMPORTANT: Save only the hotel info data for each hotel and not any formatting marks or html tags. The tags are residuals from web scraping. They should not be included in the user review DataFrame.

Deliverables

Provide the following:

Part I

For each step, provide example output that demonstrates results. The commented, syntactically correct code and examples of the results of applying it should provide a full explanation of each step.

Ways to demonstrate the results of creating or transforming data include outputting example records from DataFrames (using the .head() method, for example), and describing the types of a DataFrame’s columns.

Part II

This should be done like part I, above. Take note of the requirement that the code should work for an arbitrarily large number of json files.

Customers of a hotel speak-winter-2017 CC by Lynd Bacon & Associates, Ltd. DBA Lorma Buena Associates is licensed under CC BY 4.0.




Leave a Reply