1

I am new to python and having trouble with something that seems conceptually very simple. I've read a number of SO posts but still can't solve my problem(s).

I have a function to convert amazon reviews to json format. Each review becomes a single json object. I would like to compile all reviews in a single dataframe, with the json keys as columns and each review in a row.

There are a large number of reviews, each formatted like so:

{
"product/productId": "B00006HAXW",
"product/title": "Winnie the Pooh",
"product/price": "unknown",
"review/userId": "A1RSDE90N6RSZF",
"review/profileName": "piglet",
"review/helpfulness": "9/9",
"review/score": "5.0",
"review/time": "1042502400",
"review/summary": "Love this book", 
"review/text" : "Exciting stories about highly intelligent creatures, very inspiring!"
}

How can I compile all reviews into a pandas dataframe? I'm having two separate problems:

  1. How do I compile all reviews in one object? Currently, the output is generated like so:

    for e in parse("reviews.txt.gz"):
        print json.dumps(e)
    

I tried creating an empty list and using append:

    for e in parse("reviews.txt.gz"):
        revs = []
        revs = revs.append(json.dumps(e))

but that does not work - print revs prints out

None
None
None 
  1. When I use pd.read_json on a single review formatted as above, it returns "If using all scalar values, you must must pass an index". Does this mean I do not have valid json format data?
4
  • 1
    Looks like you're re-initializing an empty list with revs = [] for every loop, then re-assigning revs to the output of a list.append call (which is None; list.append modifies the original list). Additionally, you likely don't need the json.dumps(e) call, you want a list of python objects not json objects. Commented Mar 13, 2015 at 20:54
  • Is the bit parse("reviews.txt.gz") working? is that what produces the example json you posted? Commented Mar 13, 2015 at 20:55
  • @cphlewis Yes, the sample json formatted review is what is produced by parse("file"). Commented Mar 13, 2015 at 21:05
  • @jeff of course, must initialize list outside loop. thank you. Commented Mar 13, 2015 at 21:16

1 Answer 1

1
  1. There is no need to call json.dumps() on the data as this returns a string and you can pass python objects to Pandas.

Your for loop should look like

revs = []
for e in parse("reviews.txt.gz"):
    revs = revs.append(e)

But unless parse is a generator (ie. uses the yield keyword), you can just set revs = parse("reviews.txt.gz")

  1. pd.read_json attempts to parse the json as a DataFrame... If you only have one column then, this will throw an error as it expects the data to be doubly indexed.

So if revs is now a list of strings (ie. your parse function returns json representations of the data), you can call

df = pd.read_json(revs)

Otherwise if revs is now a list of dictionaries (ie. your parse function has already interpreted the json and returns dictionaries of the data), you can call

df = pd.DataFrame(revs)
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks @Alex !! Parse is a generator. Removing the call to json.dumps and initializing revs outside the loop solved problem 1. And as that produced a list of dictionaries, calling pd.DataFrame solved problem 2.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.