python - binary encoding of column containing multiple terms

Question

I need to do a binary transformation of a column containing lists of strings separated by comma.

Can you help me in getting from here:

df = pd.DataFrame({'_id': [1,2,3],
                   'test': [['one', 'two', 'three'], 
                            ['three', 'one'], 
                            ['four', 'one']]})
df

_id  test
 1   [one, two, three]
 2   [three, one]
 3   [four, one]

to:

df_result = pd.DataFrame({'_id': [1,2,3], 
                          'one': [1,1,1], 
                          'two': [1,0,0], 
                          'three': [1,1,0], 
                          'four': [0,0,1]})

df_result[['_id', 'one', 'two', 'three', 'four']]

 _id    one two  three  four
   1    1   1    1      0
   2    1   0    1      0
   3    1   0    0      1

Any help would be very appreciated!

jezrael · Accepted Answer · 2017-07-28 09:07:59Z

You can use str.get_dummies, pop for extract column out, convert to str by str.join and last join:

df = df.join(df.pop('test').str.join('|').str.get_dummies())
print (df)
   _id  four  one  three  two
0    1     0    1      1    1
1    2     0    1      1    0
2    3     1    1      0    0

Instead pop is possible use drop:

df = df.drop('test', axis=1).join(df.pop('test').str.join('|').str.get_dummies())
print (df)
   _id  four  one  three  two
0    1     0    1      1    1
1    2     0    1      1    0
2    3     1    1      0    0

Solution with new DataFrame:

df1 = pd.get_dummies(pd.DataFrame(df.pop('test').values.tolist()), prefix='', prefix_sep='')
df = df.join(df1.groupby(level=0, axis=1).max())
print (df)
   _id  four  one  three  two
0    1     0    1      1    1
1    2     0    1      1    0
2    3     1    1      0    0

I try also solution with converting to string by astype, but some cleaning is necessary:

df1=df.pop('test').astype(str).str.strip("'[]").str.replace("',\s+'", '|').str.get_dummies()
df = df.join(df1)
print (df)
   _id  four  one  three  two
0    1     0    1      1    1
1    2     0    1      1    0
2    3     1    1      0    0

MaxU - stand with Ukraine · Accepted Answer · 2017-07-28 09:13:36Z

3

We can use sklearn.preprocessing.MultiLabelBinarizer method:

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()

df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('test')),
                          columns=mlb.classes_,
                          index=df.index))

Result:

In [15]: df
Out[15]:
   _id  four  one  three  two
0    1     0    1      1    1
1    2     0    1      1    0
2    3     1    1      0    0

answered Jul 28, 2017 at 9:13

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

Collectives™ on Stack Overflow

python - binary encoding of column containing multiple terms

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related