Loading CSV Data with NumPy Fails with an Error

Question

I have a CSV that has the first few rows like this:

3Blue1Brown;UCYO_jab_esuFRV4b17AJtAw;Q&A #2 + Net Neutrality Nuance;liL66CApESk;2017-12-14 03:59:29+00:00;141644;4661;107;329;1409.0;43.5607476635514;0.002322724577108808;100.52803406671399
3Blue1Brown;UCYO_jab_esuFRV4b17AJtAw;The hardest problem on the hardest test;OkmNXy7er84;2017-12-08 04:52:24+00:00;13109536;346554;5569;19721;1415.0;62.229125516250676;0.0015043247907477427;9264.689752650176

When I tried to load this into my NumPy array, I get some errors that I do not understand. I guess it might have to do with the special characters? Or may be the encoding format of this CSV data? Here is the code:

from numpy import loadtxt
import numpy as np

datas_path = 'target/youtube_videos.csv'
data = np.genfromtxt(datas_path, delimiter=';', dtype=None, names=True,\
       deletechars="~!@#$%^&*()-=+~\|]}[{';: /?.>,<.", case_sensitive=True)

Here is the error:

Fail to execute line 11:        deletechars="~!@#$%^&*()-=+~\|]}[{';: /?.>,<.", case_sensitive=True)
Traceback (most recent call last):
  File "/tmp/1636087867308-0/zeppelin_python.py", line 158, in <module>
    exec(code, _zcUserQueryNameSpace)
  File "<stdin>", line 11, in <module>
  File "/home/joesan/.pyenv/versions/3.7.8/lib/python3.7/site-packages/numpy/lib/npyio.py", line 2124, in genfromtxt
    raise ValueError(errmsg)
ValueError: Some errors were detected !
    Line #60 (got 3 columns instead of 13)
    Line #353 (got 3 columns instead of 13)
    Line #720 (got 3 columns instead of 13)
    Line #3008 (got 3 columns instead of 13)
    Line #3077 (got 3 columns instead of 13)
    Line #3129 (got 3 columns instead of 13)
    Line #3154 (got 3 columns instead of 13)
    Line #3163 (got 3 columns instead of 13)
    Line #3175 (got 3 columns instead of 13)
    Line #3290 (got 3 columns instead of 13)
    Line #3300 (got 3 columns instead of 13)
    Line #3310 (got 3 columns instead of 13)
    Line #3316 (got 3 columns instead of 13)
    Line #3321 (got 3 columns instead of 13)
    Line #3328 (got 3 columns instead of 13)
    Line #3334 (got 3 columns instead of 13)
    Line #3340 (got 3 columns instead of 13)
    Line #3361 (got 3 columns instead of 13)
    Line #3366 (got 3 columns instead of 13)
    Line #3367 (got 3 columns instead of 13)
    Line #3375 (got 3 columns instead of 13)
    Line #3385 (got 3 columns instead of 13)
    Line #3397 (got 3 columns instead of 13)
    Line #3407 (got 3 columns instead of 13)
    Line #3433 (got 3 columns instead of 13)
    Line #3444 (got 3 columns instead of 13)
    Line #3450 (got 3 columns instead of 13)
    Line #3452 (got 3 columns instead of 13)
    Line #3482 (got 3 columns instead of 13)
    Line #3511 (got 3 columns instead of 13)
    Line #3522 (got 3 columns instead of 13)
    Line #3531 (got 3 columns instead of 13)
    Line #3536 (got 3 columns instead of 13)

Line # 60 is the first record in my CSV file given above.

EDIT: I managed to fix it like this:

data = np.genfromtxt(datas_path, delimiter=';', dtype=str, comments='%', names=True,\
       deletechars="~!@#$%^&*()-=+~\|]}[{';: /?.>,<.", case_sensitive=True)

But now this line fails:

Adam Savage’s Tested;UCiDJtJKMICpb9B1qf7qjEOA;99% Invisible - The Adam Savage Project - 10/6/20;ClxSdX3ynGQ;2020-10-03 01:45:19+00:00;28334;782;29;87;385.0;26.96551724137931;0.0030705159878591094;73.59480519480519

Unless you have a specific reason to use Numpy, Pandas may be a better option as it's better suited for parsing strings than Numpy. Using this: python -c "import pandas as pd; datas_path = 'target/youtube_videos.csv'; df = pd.read_csv(datas_path, sep=';'); print(df)" works flawlessly. Numpy is brittle when parsing text, unfortunately. — jrd1
– jrd1, Commented Nov 5, 2021 at 5:46
Ok! That sounds good. It works as expected. Could you please post that as an asnwer? — joesan
– joesan, Commented Nov 5, 2021 at 5:51
The # character might giving problems. Default is to treat it as comment marker. There is a parameter to change that. — hpaulj
– hpaulj, Commented Nov 5, 2021 at 6:06

jrd1 · Accepted Answer · 2021-11-05 06:06:49Z

2

Unless you have a specific reason to use Numpy, Pandas may be a better option as it's better suited for parsing strings than Numpy. This snippet illustrates how to apply Pandas to parse the CSV using semicolons as the delimiter / separator:

import pandas as pd
datas_path = 'target/youtube_videos.csv'
df = pd.read_csv(datas_path, sep=';')
print(df)

This correctly parses the lines of text that were being incorrectly parsed with Numpy. Numpy is unfortunately a bit fragile when it comes to parsing text. But, as Pandas uses Numpy under the hood, one can easily adapt any Pandas objects - i.e. Series and DataFrames - for use with Numpy related routines.

Reference:

Original comment.
Pandas documentation on read_csv.

edited Nov 5, 2021 at 6:06

answered Nov 5, 2021 at 5:56

jrd1

10.8k4 gold badges37 silver badges53 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Loading CSV Data with NumPy Fails with an Error

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related