2

I have a data frame with this structure:

>>> df
   ID    Class           Type
0   1     Math       Calculus
1   1     Math        Algebra
2   1  Science        Physics
3   1  History       American
4   2     Math  Factorization
5   2  History       European
6   2  Science      Chemistry
7   2  Science        Biology
8   3     Math    Computation
9   3  Science        Biology

Desired output is a structure that maps the ID to the Class and the Class to the Type for each ID.

for example:

{
1: {Math: [Calculus, Algebra], Science: [Physics], History: [American]}
2: {Math: [Factorization], History: [European], Science: [Chemistry, Biology]}
3: {Math: [Computation], Science: [Biology]}
}

I am able to accomplish this with a for loop but the data set is very large (approximately 30 million rows) so I would like to accomplish this with Pandas)

I was able to get the output for a single ID formatted correctly like this

>>> df.groupby(['ID', 'Class'])['Type'].apply(lambda x: x.to_dict())[1].groupby('Class').apply(lambda x: x.to_list()).to_dict()
{'History': ['American'], 'Math': ['Calculus', 'Algebra'], 'Science': ['Physics']}
>>> df.groupby(['ID', 'Class'])['Type'].apply(lambda x: x.to_dict())[2].groupby('Class').apply(lambda x: x.to_list()).to_dict()
{'History': ['European'], 'Math': ['Factorization'], 'Science': ['Chemistry', 'Biology']}

How can I apply the logic above to all the IDs and also is there an easier way to do this? I think I nested too many groupbys and over complicated the problem but not sure how to do this in a more efficient manner

1 Answer 1

1

IIUC you can try to play from this:

import pandas as pd

txt="""0   1     Math       Calculus
1   1     Math        Algebra
2   1  Science        Physics
3   1  History       American
4   2     Math  Factorization
5   2  History       European
6   2  Science      Chemistry
7   2  Science        Biology
8   3     Math    Computation
9   3  Science        Biology"""

txt = [list(filter(lambda a: a != '', t.split(" ")))[1:]
       for t in txt.split("\n")]

df =  pd.DataFrame(txt, columns=["ID", 'Class', 'Type'])
df["ID"] = df["ID"].astype(int)


out = df.groupby("ID")\
        .apply(lambda x: x.groupby("Class")\
                          .apply(lambda y:y["Type"].tolist()).to_dict())

which returns

ID
1    {'History': ['American'], 'Math': ['Calculus',...
2    {'History': ['European'], 'Math': ['Factorization',...
3    {'Math': ['Computation'], 'Science': ['Biology']}
dtype: object

Now you have access to your data via (as example) out[1]["Math"] which returns ['Calculus', 'Algebra']

Sign up to request clarification or add additional context in comments.

5 Comments

ahh interesting! i've never seen nested lambda functions like that. thank you!
For a big df you might consider to store a table obtained by df.groupby(["ID", "Class"])["Type"].unique().reset_index()
It should be faster to calculate with a big df as you avoid 2 apply.
I will try that out. I think the calculation only needs to be done one time in the workflow but I will try timing it and test it. Thanks again!
Let me know then.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.