Pandas Dataframe to nested data structure

Question

I have a data frame with this structure:

>>> df
   ID    Class           Type
0   1     Math       Calculus
1   1     Math        Algebra
2   1  Science        Physics
3   1  History       American
4   2     Math  Factorization
5   2  History       European
6   2  Science      Chemistry
7   2  Science        Biology
8   3     Math    Computation
9   3  Science        Biology

Desired output is a structure that maps the ID to the Class and the Class to the Type for each ID.

for example:

{
1: {Math: [Calculus, Algebra], Science: [Physics], History: [American]}
2: {Math: [Factorization], History: [European], Science: [Chemistry, Biology]}
3: {Math: [Computation], Science: [Biology]}
}

I am able to accomplish this with a for loop but the data set is very large (approximately 30 million rows) so I would like to accomplish this with Pandas)

I was able to get the output for a single ID formatted correctly like this

>>> df.groupby(['ID', 'Class'])['Type'].apply(lambda x: x.to_dict())[1].groupby('Class').apply(lambda x: x.to_list()).to_dict()
{'History': ['American'], 'Math': ['Calculus', 'Algebra'], 'Science': ['Physics']}
>>> df.groupby(['ID', 'Class'])['Type'].apply(lambda x: x.to_dict())[2].groupby('Class').apply(lambda x: x.to_list()).to_dict()
{'History': ['European'], 'Math': ['Factorization'], 'Science': ['Chemistry', 'Biology']}

How can I apply the logic above to all the IDs and also is there an easier way to do this? I think I nested too many groupbys and over complicated the problem but not sure how to do this in a more efficient manner

rpanai · Accepted Answer · 2019-06-27 21:13:13Z

1

IIUC you can try to play from this:

import pandas as pd

txt="""0   1     Math       Calculus
1   1     Math        Algebra
2   1  Science        Physics
3   1  History       American
4   2     Math  Factorization
5   2  History       European
6   2  Science      Chemistry
7   2  Science        Biology
8   3     Math    Computation
9   3  Science        Biology"""

txt = [list(filter(lambda a: a != '', t.split(" ")))[1:]
       for t in txt.split("\n")]

df =  pd.DataFrame(txt, columns=["ID", 'Class', 'Type'])
df["ID"] = df["ID"].astype(int)


out = df.groupby("ID")\
        .apply(lambda x: x.groupby("Class")\
                          .apply(lambda y:y["Type"].tolist()).to_dict())

which returns

ID
1    {'History': ['American'], 'Math': ['Calculus',...
2    {'History': ['European'], 'Math': ['Factorization',...
3    {'Math': ['Computation'], 'Science': ['Biology']}
dtype: object

Now you have access to your data via (as example) out[1]["Math"] which returns ['Calculus', 'Algebra']

edited Jun 27, 2019 at 21:13

answered Jun 27, 2019 at 21:01

rpanai

13.5k3 gold badges48 silver badges65 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

greenteam Over a year ago

ahh interesting! i've never seen nested lambda functions like that. thank you!

rpanai Over a year ago

For a big df you might consider to store a table obtained by df.groupby(["ID", "Class"])["Type"].unique().reset_index()

rpanai Over a year ago

It should be faster to calculate with a big df as you avoid 2 apply.

greenteam Over a year ago

I will try that out. I think the calculation only needs to be done one time in the workflow but I will try timing it and test it. Thanks again!

rpanai Over a year ago

Let me know then.

Collectives™ on Stack Overflow

Pandas Dataframe to nested data structure

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related