1

I have a csv file of the following format that I am trying to normalise. The numbers represent the counts for associated strings. The file contains close to 100K entries.

159028,CASSVDGSYEQYFGPG
86832,CASSLQLYFGEG
74720,CASSQDQDTQYFGPG
71701,CASSRVGSDYTFGSG
69360,CARNVTPPKSYAVFFGKG
52458,CAAEQFFGPG
51406,CASSSGDQDTQYFGPG
50305,CASQLYFGEG
38745,CAYFGPG
32565,CASSPDWGENTLYFGAG

I have tried to create a dictionary using the following

import csv
input = csv.DictReader(open("data.csv"))
for row in input:
    print(row)

Result

{'159028': '86832', 'CASSVDGSYEQYFGPG': 'CASSLQLYFGEG'}
{'159028': '74720', 'CASSVDGSYEQYFGPG': 'CASSQDQDTQYFGPG'}
{'159028': '71701', 'CASSVDGSYEQYFGPG': 'CASSRVGSDYTFGSG'}
{'159028': '69360', 'CASSVDGSYEQYFGPG': 'CARNVTPPKSYAVFFGKG'}
{'159028': '52458', 'CASSVDGSYEQYFGPG': 'CAAEQFFGPG'}
{'159028': '51406', 'CASSVDGSYEQYFGPG': 'CASSSGDQDTQYFGPG'}
{'159028': '50305', 'CASSVDGSYEQYFGPG': 'CASQLYFGEG'}
{'159028': '38745', 'CASSVDGSYEQYFGPG': 'CAYFGPG'}
{'159028': '32565', 'CASSVDGSYEQYFGPG': 'CASSPDWGENTLYFGAG'}
...

Instead of

        {'CASSVDGSYEQYFGPG': 159028}        
        {'CASSLQLYFGEG': '86832'}
        {'CASSQDQDTQYFGPG': '74720'}
        {'CASSRVGSDYTFGSG': '71701'}
        {'CARNVTPPKSYAVFFGKG': '69360'}
        {'CAAEQFFGPG': '52458'}
        {'CASSSGDQDTQYFGPG': '51406'}
        {'CASQLYFGEG': '50305'}
        {'CAYFGPG': '38745'}
        {'CASSPDWGENTLYFGAG': '32565'}
        ...

I also tried converting the csv file into a numpy array, but I get the following:

>>>from numpy import genfromtxt
>>>data = genfromtxt('data.csv', delimiter=',')
>>>data
array([[  1.59028000e+05,              nan],
       [  8.68320000e+04,              nan],
       [  7.47200000e+04,              nan],
       ...,
       [  1.00000000e+00,              nan],
       [  1.00000000e+00,              nan],
       [  1.00000000e+00,              nan]])

There may be other ways of normalising and other data processing this data via Python.

2
  • file = {x[1]: x[0] for x in np.loadtxt("data.csv", dtype=str, delimiter=",")} Why a dictionary? Commented Sep 7, 2016 at 18:55
  • Do you want the 1st column as strings or integers? Commented Sep 7, 2016 at 19:58

2 Answers 2

1

Use Numpy loadtxt to import, then use a dict comprehension if you need it as a dict.

import numpy as np

arr = np.loadtxt('data.csv', dtype=str, delimiter=",")

b = dict([(y, x) for (x, y) in arr])
Sign up to request clarification or add additional context in comments.

Comments

0

genfromtxt has many arguments, and it can take a while to learn the right incantation to read any given file.

Here's how you can do it with your file. The array data returned by genfromtxt is a one-dimensional structured array with two fields, called count and string:

In [11]: data = np.genfromtxt("counts_strings.csv", delimiter=',', names=['count', 'string'], dtype=None)

In [12]: data['count']
Out[12]: 
array([159028,  86832,  74720,  71701,  69360,  52458,  51406,  50305,
        38745,  32565])

In [13]: data['string']
Out[13]: 
array([b'CASSVDGSYEQYFGPG', b'CASSLQLYFGEG', b'CASSQDQDTQYFGPG',
       b'CASSRVGSDYTFGSG', b'CARNVTPPKSYAVFFGKG', b'CAAEQFFGPG',
       b'CASSSGDQDTQYFGPG', b'CASQLYFGEG', b'CAYFGPG', b'CASSPDWGENTLYFGAG'], 
      dtype='|S18')

In [14]: data[0]
Out[14]: (159028, b'CASSVDGSYEQYFGPG')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.