split arrays as string in one column of pandas to multiple columns

Question

I have data frame in the below format. The description is in string format.

file	description
x	[[array(['MIT', 'MIT', 'MIT', 'MIT', 'MIT'], dtype=object), array([0.71641791, 0.71641791, 0.71641791, 0.69565217, 0.69565217])]]
y	[[array(['APSL-1.0', 'APSL-1.0', 'APSL-1.0', 'APSL-1.0', 'APSL-1.0'], dtype=object), array([0.28552457, 0.28552457, 0.28552457, 0.28552457, 0.28552457])]]

How can i convert data Frame into below format.

file	license	score
x	['MIT', 'MIT', 'MIT', 'MIT', 'MIT']	[0.71641791, 0.71641791, 0.71641791, 0.69565217, 0.69565217]
y	['APSL-1.0', 'APSL-1.0', 'APSL-1.0', 'APSL-1.0', 'APSL-1.0']	[0.28552457, 0.28552457, 0.28552457, 0.28552457, 0.28552457]

Above is just an example. Data frame is very large.

Mahdi F. · Accepted Answer · 2022-07-10 04:18:13Z

1

Update, If elements in the column as string format, you can find array with regex formula. (Note don't use eval, Why should exec() and eval() be avoided?)

import ast
new_cols = lambda x: pd.Series({'licence':ast.literal_eval(x[0]), 
                                'score':ast.literal_eval(x[1])})

df = df.join(df['m'].str.findall(r'\[[^\[\]]*\]').apply(new_cols)).drop('m', axis=1)
print(df)

Output:

    file    licence                                              score
0   x       ['MIT', 'MIT', 'MIT', 'MIT', 'MIT']                     [0.71641791, 0.71641791, 0.71641791, 0.6956521...
1   y       ['APSL-1.0', 'APSL-1.0', 'APSL-1.0', 'APSL-1.0', ...    [0.28552457, 0.28552457, 0.28552457, 0.2855245...

How regex formula find arrays: (find string start with [ and end with ] but in finding string should not have [ or ] to find all arrays.)

>>> import re
>>> re.findall(r'\[[^\[\]]*\]', "[[np.array(['MIT', 'MIT', 'MIT', 'MIT', 'MIT'], dtype=object), np.array([0.71641791, 0.71641791, 0.71641791, 0.69565217, 0.69565217])]]",)
["['MIT', 'MIT', 'MIT', 'MIT', 'MIT']",
 '[0.71641791, 0.71641791, 0.71641791, 0.69565217, 0.69565217]']

Old, You can create new column then join with old dataframe.

new_cols = lambda x: pd.Series({'licence':x[0][0], 'score':x[0][1]})
df = df.join(df['m'].apply(new_cols)).drop('m', axis=1)
print(df)

edited Jul 10, 2022 at 4:18

answered Jul 10, 2022 at 3:18

Mahdi F.

24.1k5 gold badges25 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

sudojarvis Over a year ago

sorry i forgot to mention that the description is in string format.

BeRT2me Over a year ago

There are (although few) cases where eval may be acceptable, I'd argue that this could be one of those cases, as long as it's clearly known and controlled by OP where the data is coming from~

Mahdi F. Over a year ago

@BeRT2me, in you code If in this column user write as string remove all file from os, your code remove all file from os, because you eval this operation

BeRT2me Over a year ago

@I'mahdi and? If I control the data, it doesn't matter. That's only if there is an outside attack vector possible.

sudojarvis Over a year ago

@I'mahdi thank for your help and time. is there any way i can speed up this for 5 million data.

|

BeRT2me · Accepted Answer · 2022-07-10 04:15:17Z

Input:

  file                                        description
0    x  [[array(['MIT', 'MIT', 'MIT', 'MIT', 'MIT'], d...
1    y  [[array(['APSL-1.0', 'APSL-1.0', 'APSL-1.0', '...

Doing:

import ast

df.description = (df.description.str.replace('array', '')
                    .str.replace(', dtype=object', '')
                    .apply(ast.literal_eval))
df[['license', 'score']] = [(x[0], x[1]) for x in df.description.str[0]]
df = df.drop('description', axis=1)
print(df)

Output:

  file                                            license                                              score
0    x                          [MIT, MIT, MIT, MIT, MIT]  [0.71641791, 0.71641791, 0.71641791, 0.6956521...
1    y  [APSL-1.0, APSL-1.0, APSL-1.0, APSL-1.0, APSL-...  [0.28552457, 0.28552457, 0.28552457, 0.2855245...

Collectives™ on Stack Overflow

split arrays as string in one column of pandas to multiple columns

2 Answers 2

10 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

10 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related