1

I have data frame in the below format. The description is in string format.

file description
x [[array(['MIT', 'MIT', 'MIT', 'MIT', 'MIT'], dtype=object), array([0.71641791, 0.71641791, 0.71641791, 0.69565217, 0.69565217])]]
y [[array(['APSL-1.0', 'APSL-1.0', 'APSL-1.0', 'APSL-1.0', 'APSL-1.0'], dtype=object), array([0.28552457, 0.28552457, 0.28552457, 0.28552457, 0.28552457])]]

How can i convert data Frame into below format.

file license score
x ['MIT', 'MIT', 'MIT', 'MIT', 'MIT'] [0.71641791, 0.71641791, 0.71641791, 0.69565217, 0.69565217]
y ['APSL-1.0', 'APSL-1.0', 'APSL-1.0', 'APSL-1.0', 'APSL-1.0'] [0.28552457, 0.28552457, 0.28552457, 0.28552457, 0.28552457]

​Above is just an example. Data frame is very large.

2 Answers 2

1

Update, If elements in the column as string format, you can find array with regex formula. (Note don't use eval, Why should exec() and eval() be avoided?)

import ast
new_cols = lambda x: pd.Series({'licence':ast.literal_eval(x[0]), 
                                'score':ast.literal_eval(x[1])})

df = df.join(df['m'].str.findall(r'\[[^\[\]]*\]').apply(new_cols)).drop('m', axis=1)
print(df)

Output:

    file    licence                                              score
0   x       ['MIT', 'MIT', 'MIT', 'MIT', 'MIT']                     [0.71641791, 0.71641791, 0.71641791, 0.6956521...
1   y       ['APSL-1.0', 'APSL-1.0', 'APSL-1.0', 'APSL-1.0', ...    [0.28552457, 0.28552457, 0.28552457, 0.2855245...

How regex formula find arrays: (find string start with [ and end with ] but in finding string should not have [ or ] to find all arrays.)

>>> import re
>>> re.findall(r'\[[^\[\]]*\]', "[[np.array(['MIT', 'MIT', 'MIT', 'MIT', 'MIT'], dtype=object), np.array([0.71641791, 0.71641791, 0.71641791, 0.69565217, 0.69565217])]]",)
["['MIT', 'MIT', 'MIT', 'MIT', 'MIT']",
 '[0.71641791, 0.71641791, 0.71641791, 0.69565217, 0.69565217]']

Old, You can create new column then join with old dataframe.

new_cols = lambda x: pd.Series({'licence':x[0][0], 'score':x[0][1]})
df = df.join(df['m'].apply(new_cols)).drop('m', axis=1)
print(df)
Sign up to request clarification or add additional context in comments.

10 Comments

sorry i forgot to mention that the description is in string format.
There are (although few) cases where eval may be acceptable, I'd argue that this could be one of those cases, as long as it's clearly known and controlled by OP where the data is coming from~
@BeRT2me, in you code If in this column user write as string remove all file from os, your code remove all file from os, because you eval this operation
@I'mahdi and? If I control the data, it doesn't matter. That's only if there is an outside attack vector possible.
@I'mahdi thank for your help and time. is there any way i can speed up this for 5 million data.
|
0

Input:

  file                                        description
0    x  [[array(['MIT', 'MIT', 'MIT', 'MIT', 'MIT'], d...
1    y  [[array(['APSL-1.0', 'APSL-1.0', 'APSL-1.0', '...

Doing:

import ast

df.description = (df.description.str.replace('array', '')
                    .str.replace(', dtype=object', '')
                    .apply(ast.literal_eval))
df[['license', 'score']] = [(x[0], x[1]) for x in df.description.str[0]]
df = df.drop('description', axis=1)
print(df)

Output:

  file                                            license                                              score
0    x                          [MIT, MIT, MIT, MIT, MIT]  [0.71641791, 0.71641791, 0.71641791, 0.6956521...
1    y  [APSL-1.0, APSL-1.0, APSL-1.0, APSL-1.0, APSL-...  [0.28552457, 0.28552457, 0.28552457, 0.2855245...

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.