0

I have a CSV that has the first few rows like this:

3Blue1Brown;UCYO_jab_esuFRV4b17AJtAw;Q&A #2 + Net Neutrality Nuance;liL66CApESk;2017-12-14 03:59:29+00:00;141644;4661;107;329;1409.0;43.5607476635514;0.002322724577108808;100.52803406671399
3Blue1Brown;UCYO_jab_esuFRV4b17AJtAw;The hardest problem on the hardest test;OkmNXy7er84;2017-12-08 04:52:24+00:00;13109536;346554;5569;19721;1415.0;62.229125516250676;0.0015043247907477427;9264.689752650176

When I tried to load this into my NumPy array, I get some errors that I do not understand. I guess it might have to do with the special characters? Or may be the encoding format of this CSV data? Here is the code:

from numpy import loadtxt
import numpy as np

datas_path = 'target/youtube_videos.csv'
data = np.genfromtxt(datas_path, delimiter=';', dtype=None, names=True,\
       deletechars="~!@#$%^&*()-=+~\|]}[{';: /?.>,<.", case_sensitive=True)

Here is the error:

Fail to execute line 11:        deletechars="~!@#$%^&*()-=+~\|]}[{';: /?.>,<.", case_sensitive=True)
Traceback (most recent call last):
  File "/tmp/1636087867308-0/zeppelin_python.py", line 158, in <module>
    exec(code, _zcUserQueryNameSpace)
  File "<stdin>", line 11, in <module>
  File "/home/joesan/.pyenv/versions/3.7.8/lib/python3.7/site-packages/numpy/lib/npyio.py", line 2124, in genfromtxt
    raise ValueError(errmsg)
ValueError: Some errors were detected !
    Line #60 (got 3 columns instead of 13)
    Line #353 (got 3 columns instead of 13)
    Line #720 (got 3 columns instead of 13)
    Line #3008 (got 3 columns instead of 13)
    Line #3077 (got 3 columns instead of 13)
    Line #3129 (got 3 columns instead of 13)
    Line #3154 (got 3 columns instead of 13)
    Line #3163 (got 3 columns instead of 13)
    Line #3175 (got 3 columns instead of 13)
    Line #3290 (got 3 columns instead of 13)
    Line #3300 (got 3 columns instead of 13)
    Line #3310 (got 3 columns instead of 13)
    Line #3316 (got 3 columns instead of 13)
    Line #3321 (got 3 columns instead of 13)
    Line #3328 (got 3 columns instead of 13)
    Line #3334 (got 3 columns instead of 13)
    Line #3340 (got 3 columns instead of 13)
    Line #3361 (got 3 columns instead of 13)
    Line #3366 (got 3 columns instead of 13)
    Line #3367 (got 3 columns instead of 13)
    Line #3375 (got 3 columns instead of 13)
    Line #3385 (got 3 columns instead of 13)
    Line #3397 (got 3 columns instead of 13)
    Line #3407 (got 3 columns instead of 13)
    Line #3433 (got 3 columns instead of 13)
    Line #3444 (got 3 columns instead of 13)
    Line #3450 (got 3 columns instead of 13)
    Line #3452 (got 3 columns instead of 13)
    Line #3482 (got 3 columns instead of 13)
    Line #3511 (got 3 columns instead of 13)
    Line #3522 (got 3 columns instead of 13)
    Line #3531 (got 3 columns instead of 13)
    Line #3536 (got 3 columns instead of 13)

Line # 60 is the first record in my CSV file given above.

EDIT: I managed to fix it like this:

data = np.genfromtxt(datas_path, delimiter=';', dtype=str, comments='%', names=True,\
       deletechars="~!@#$%^&*()-=+~\|]}[{';: /?.>,<.", case_sensitive=True)

But now this line fails:

Adam Savage’s Tested;UCiDJtJKMICpb9B1qf7qjEOA;99% Invisible - The Adam Savage Project - 10/6/20;ClxSdX3ynGQ;2020-10-03 01:45:19+00:00;28334;782;29;87;385.0;26.96551724137931;0.0030705159878591094;73.59480519480519
4
  • 1
    Unless you have a specific reason to use Numpy, Pandas may be a better option as it's better suited for parsing strings than Numpy. Using this: python -c "import pandas as pd; datas_path = 'target/youtube_videos.csv'; df = pd.read_csv(datas_path, sep=';'); print(df)" works flawlessly. Numpy is brittle when parsing text, unfortunately. Commented Nov 5, 2021 at 5:46
  • 1
    Ok! That sounds good. It works as expected. Could you please post that as an asnwer? Commented Nov 5, 2021 at 5:51
  • I'm glad to know it worked! :-) And, done! Commented Nov 5, 2021 at 5:56
  • 1
    The # character might giving problems. Default is to treat it as comment marker. There is a parameter to change that. Commented Nov 5, 2021 at 6:06

1 Answer 1

2

Unless you have a specific reason to use Numpy, Pandas may be a better option as it's better suited for parsing strings than Numpy. This snippet illustrates how to apply Pandas to parse the CSV using semicolons as the delimiter / separator:

import pandas as pd
datas_path = 'target/youtube_videos.csv'
df = pd.read_csv(datas_path, sep=';')
print(df)

This correctly parses the lines of text that were being incorrectly parsed with Numpy. Numpy is unfortunately a bit fragile when it comes to parsing text. But, as Pandas uses Numpy under the hood, one can easily adapt any Pandas objects - i.e. Series and DataFrames - for use with Numpy related routines.

Reference:

  1. Original comment.
  2. Pandas documentation on read_csv.
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.