convert multiple json objects to pandas dataframe

Question

I am new to python and having trouble with something that seems conceptually very simple. I've read a number of SO posts but still can't solve my problem(s).

I have a function to convert amazon reviews to json format. Each review becomes a single json object. I would like to compile all reviews in a single dataframe, with the json keys as columns and each review in a row.

There are a large number of reviews, each formatted like so:

{
"product/productId": "B00006HAXW",
"product/title": "Winnie the Pooh",
"product/price": "unknown",
"review/userId": "A1RSDE90N6RSZF",
"review/profileName": "piglet",
"review/helpfulness": "9/9",
"review/score": "5.0",
"review/time": "1042502400",
"review/summary": "Love this book", 
"review/text" : "Exciting stories about highly intelligent creatures, very inspiring!"
}

How can I compile all reviews into a pandas dataframe? I'm having two separate problems:

How do I compile all reviews in one object? Currently, the output is generated like so:
```
for e in parse("reviews.txt.gz"):
    print json.dumps(e)
```

I tried creating an empty list and using append:

    for e in parse("reviews.txt.gz"):
        revs = []
        revs = revs.append(json.dumps(e))

but that does not work - print revs prints out

None
None
None

When I use pd.read_json on a single review formatted as above, it returns "If using all scalar values, you must must pass an index". Does this mean I do not have valid json format data?

Looks like you're re-initializing an empty list with revs = [] for every loop, then re-assigning revs to the output of a list.append call (which is None; list.append modifies the original list). Additionally, you likely don't need the json.dumps(e) call, you want a list of python objects not json objects. — Jeff
– Jeff, Commented Mar 13, 2015 at 20:54
Is the bit parse("reviews.txt.gz") working? is that what produces the example json you posted? — cphlewis
– cphlewis, Commented Mar 13, 2015 at 20:55
@cphlewis Yes, the sample json formatted review is what is produced by parse("file"). — saraw
– saraw, Commented Mar 13, 2015 at 21:05
@jeff of course, must initialize list outside loop. thank you. — saraw
– saraw, Commented Mar 13, 2015 at 21:16

Alex · Accepted Answer · 2015-03-13 21:08:59Z

1

There is no need to call json.dumps() on the data as this returns a string and you can pass python objects to Pandas.

Your for loop should look like

revs = []
for e in parse("reviews.txt.gz"):
    revs = revs.append(e)

But unless parse is a generator (ie. uses the yield keyword), you can just set revs = parse("reviews.txt.gz")

pd.read_json attempts to parse the json as a DataFrame... If you only have one column then, this will throw an error as it expects the data to be doubly indexed.

So if revs is now a list of strings (ie. your parse function returns json representations of the data), you can call

df = pd.read_json(revs)

Otherwise if revs is now a list of dictionaries (ie. your parse function has already interpreted the json and returns dictionaries of the data), you can call

df = pd.DataFrame(revs)

answered Mar 13, 2015 at 21:08

Alex

19.2k9 gold badges65 silver badges82 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

saraw Over a year ago

Thanks @Alex !! Parse is a generator. Removing the call to json.dumps and initializing revs outside the loop solved problem 1. And as that produced a list of dictionaries, calling pd.DataFrame solved problem 2.

Collectives™ on Stack Overflow

convert multiple json objects to pandas dataframe

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related