5

I need to do a binary transformation of a column containing lists of strings separated by comma.

Can you help me in getting from here:

df = pd.DataFrame({'_id': [1,2,3],
                   'test': [['one', 'two', 'three'], 
                            ['three', 'one'], 
                            ['four', 'one']]})
df

_id  test
 1   [one, two, three]
 2   [three, one]
 3   [four, one]

to:

df_result = pd.DataFrame({'_id': [1,2,3], 
                          'one': [1,1,1], 
                          'two': [1,0,0], 
                          'three': [1,1,0], 
                          'four': [0,0,1]})

df_result[['_id', 'one', 'two', 'three', 'four']]

 _id    one two  three  four
   1    1   1    1      0
   2    1   0    1      0
   3    1   0    0      1

Any help would be very appreciated!

2 Answers 2

6

You can use str.get_dummies, pop for extract column out, convert to str by str.join and last join:

df = df.join(df.pop('test').str.join('|').str.get_dummies())
print (df)
   _id  four  one  three  two
0    1     0    1      1    1
1    2     0    1      1    0
2    3     1    1      0    0

Instead pop is possible use drop:

df = df.drop('test', axis=1).join(df.pop('test').str.join('|').str.get_dummies())
print (df)
   _id  four  one  three  two
0    1     0    1      1    1
1    2     0    1      1    0
2    3     1    1      0    0

Solution with new DataFrame:

df1 = pd.get_dummies(pd.DataFrame(df.pop('test').values.tolist()), prefix='', prefix_sep='')
df = df.join(df1.groupby(level=0, axis=1).max())
print (df)
   _id  four  one  three  two
0    1     0    1      1    1
1    2     0    1      1    0
2    3     1    1      0    0

I try also solution with converting to string by astype, but some cleaning is necessary:

df1=df.pop('test').astype(str).str.strip("'[]").str.replace("',\s+'", '|').str.get_dummies()
df = df.join(df1)
print (df)
   _id  four  one  three  two
0    1     0    1      1    1
1    2     0    1      1    0
2    3     1    1      0    0
Sign up to request clarification or add additional context in comments.

Comments

3

We can use sklearn.preprocessing.MultiLabelBinarizer method:

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()

df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('test')),
                          columns=mlb.classes_,
                          index=df.index))

Result:

In [15]: df
Out[15]:
   _id  four  one  three  two
0    1     0    1      1    1
1    2     0    1      1    0
2    3     1    1      0    0

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.