2

I have a data set stored in NumPy array like shown in below, but all the data inside it is stored as string. How can I change the string to int or float, and store it in back?

  data = numpy.array([]) # <--- array initialized with numpy.array

in the data variable, below information is stored

 [['1' '0' '3' ..., '7.25' '' 'S']
  ['2' '1' '1' ..., '71.2833' 'C85' 'C']
   ['3' '1' '3' ..., '7.925' '' 'S']
   ..., 
   ['889' '0' '3' ..., '23.45' '' 'S']
   ['890' '1' '1' ..., '30' 'C148' 'C']
   ['891' '0' '3' ..., '7.75' '' 'Q']]

I want to change the first column to int and store the values back. To do so, I did:

 data[0::,0] = data[0::,0].astype(int)

but, it didn't change anything.

3
  • Do you mean a recarray docs.scipy.org/doc/numpy/reference/generated/…? Commented Jul 19, 2015 at 12:00
  • where does ['1' '0' '3' ..., '7.25' '' 'S'].. come from originally? Commented Jul 19, 2015 at 13:10
  • What is the shape and dtype of data? Commented Jul 19, 2015 at 15:16

3 Answers 3

3

You could set the data type (dtype) at array initialization. For example if your rows are composed by one 32-bit integer and one 4-byte string you could specify the dtype 'i4, S4'.

data = np.array([(1, 'a'), (2, 'b')], dtype='i4, S4')

You could read more about dtypes here.

Sign up to request clarification or add additional context in comments.

4 Comments

What is this doing exactly?
@PadraicCunningham You are specifying that the data type (dtype) for each row is a 4-byte integer and a 4-byte string.
I a not asking for myself, I posted a link in the comments to a recarray already. Some explanation for the OP and how he/she is going to get the original data object into an array with the first column as an integer would be good.
@PadraicCunningham: In fact it sounded a strange question from someone skilled like you ;) I will add the details to the answer.
1

I can make an array that contains strings by starting with lists of strings; note the S4 dtype:

In [690]: data=np.array([['1','0','7.23','two'],['2','3','1.32','four']])

In [691]: data
Out[691]: 
array([['1', '0', '7.23', 'two'],
       ['2', '3', '1.32', 'four']], 
      dtype='|S4')

It's more likely that such an array is created by reading a csv file.

I can also view it as an array of single byte strings - the shape and dtype has changed, but the databuffer is the same (the same 32 bytes)

In [692]: data.view('S1')
Out[692]: 
array([['1', '', '', '', '0', '', '', '', '7', '.', '2', '3', 't', 'w',
        'o', ''],
       ['2', '', '', '', '3', '', '', '', '1', '.', '3', '2', 'f', 'o',
        'u', 'r']], 
      dtype='|S1')

In fact, I can change an individual byte, changing the two of the original array to twos:

In [693]: data.view('S1')[0,-1]='s'

In [694]: data
Out[694]: 
array([['1', '0', '7.23', 'twos'],
       ['2', '3', '1.32', 'four']], 
      dtype='|S4')

But if I try to change an element of data to an integer, it is converted to a string to match the S4 dtype:

In [695]: data[1,0]=4

In [696]: data
Out[696]: 
array([['1', '0', '7.23', 'twos'],
       ['4', '3', '1.32', 'four']], 
      dtype='|S4')

The same would happen if the number came from int(data[1,0]) or some variation on that.

But I can trick it into seeing the integer as a string of bytes (represented as \x04)

In [704]: data[1,0]=np.array(4).view('S4')

In [705]: data
Out[705]: 
array([['1', '0', '7.23', 'twos'],
       ['\x04', '3', '1.32', 'four']], 
      dtype='|S4')

Arrays can share data buffers. The data attribute is a pointer to a block of memory. It's the array's dtype that controls how that block is interpreted. For example I can make another array of ints, and redirect it's data attribute:

In [714]: d2=np.zeros((2,4),dtype=int)

In [715]: d2
Out[715]: 
array([[0, 0, 0, 0],
       [0, 0, 0, 0]])

In [716]: d2.data=data.data  # change the data pointer

In [717]: d2
Out[717]: 
array([[        49,         48,  858926647, 1936684916],
       [         4,         51,  842214961, 1920298854]])

Now d2[1,0] is the integer 4. But the other items are not recognizable, because they are strings viewed as integers. That's not the same as passing them through the int() function.

I don't recommend changing the data pointer like this as a regular practice. It would be easy to mess things up. I had to take care to ensure that d2.nbytes was 32, the same as for data.

Because the buffer is sharded, a change to d2 also appears in data (but displayed according to a different dtype):

In [718]: d2[0,0]=3

In [719]: data
Out[719]: 
array([['\x03', '0', '7.23', 'twos'],
       ['\x04', '3', '1.32', 'four']], 
      dtype='|S4')

A view with a complex dtype does something similar:

In [723]: data.view('i4,i4,f,|S4')
Out[723]: 
array([[(3, 48, 4.148588672592268e-08, 'twos')],
       [(4, 51, 1.042967401332362e-08, 'four')]], 
      dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<f4'), ('f3', 'S4')])

Notice the 48 and 51 that also appear in d2. The next float column is unrecognizable.

That gives an idea of what can and cannot be done 'in-place'.

But to get an array that contains numbers and strings in a meaningful way, I it is better to construct a new structured array. Perhaps the cleanest way to do that is with an intermediary list of tuples.

In [759]: dl=[tuple(i) for i in data.tolist()]

In [760]: dl
Out[760]: [('1', '0', '7.23', 'two'), ('2', '3', '1.32', 'four')]

In [761]: np.array(dl,dtype='i4,i4,f,|S4')
Out[761]: 
array([(1, 0, 7.230000019073486, 'two'), (2, 3, 1.3200000524520874, 'four')], 
      dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<f4'), ('f3', 'S4')])

All these fields take up 4 bytes, so the nbytes is the same. But the individual values have passed through converters. I have given 'np.array' the freedom to convert values as is consistent for the input and the new dtype. That's a lot easier than trying to perform some sort of convoluted in-place conversion.

A list tuples with a mix of numbers and strings would also have worked:

[(1, 0, 7.23, 'two'), (2, 3, 1.32, 'four')]

Structured arrays are displayed a list of tuples. And in the structured array docs, values are always input as list of tuples.

recarray can also be used, but essentially that is just a array subclass that lets you access fields as attributes.

If the original array was generated from a csv file, it would have been better to use np.genfromtxt (or loadtxt) with appropriate options. It can generate the appropriate list(s) of tuples, and return a structured array directly.

Comments

0

NumPy arrays have associated types for their elements. Assigning to a slice of a NumPy array will up-cast the new data to that type. If that's not possible, the assignment will fail with an exception:

import numpy
a = numpy.array([[1, 2],[3, 4]])
print a
# [[1 2]
#  [3 4]]
print a.dtype
# int64

a[0,0] = 'look, a string'
# ValueError: invalid literal for long() with base 10: 'a'

In your case, data[0::,0].astype(int) will produce a NumPy array with associated member type int64, but assigning back into a slice of the original array will convert them back to strings.

Other than standard NumPy arrays, NumPy record arrays mentioned in Padraic's comment allow for different types for different columns.

I don't know if a standard NumPy array can be converted to a NumPy record array in-place, so constructing one like suggested in enrico's answer with

data = np.array([(1, 'a'), (2, 'b')], dtype='i4, S4')

might be the best option. If that's not possible, you can construct one from your standard NumPy array and overwrite the variable with the result:

import numpy
data = numpy.array([['1', '0', '3', '7.25', '', 'S'],
                    ['2', '1', '1', '71.2833', 'C85', 'C'],
                    ['3', '1', '3', '7.925', '', 'S'],
                    ['889', '0', '3', '23.45', '', 'S'],
                    ['890', '1', '1', '30', 'C148', 'C'],
                    ['891', '0', '3', '7.75', '', 'Q']])
print(repr(data))
# array([['1', '0', '3', '7.25', '', 'S'],
#        ['2', '1', '1', '71.2833', 'C85', 'C'],
#        ['3', '1', '3', '7.925', '', 'S'],
#        ['889', '0', '3', '23.45', '', 'S'],
#        ['890', '1', '1', '30', 'C148', 'C'],
#        ['891', '0', '3', '7.75', '', 'Q']], 
#       dtype='|S7')

data = numpy.core.records.fromarrays(data.T, dtype='i4,S4,S4,S4,S4,S4')
print(repr(data))
# rec.array([(1, '0', '3', '7.25', '', 'S'), (2, '1', '1', '71.2', 'C85', 'C'),
#        (3, '1', '3', '7.92', '', 'S'), (889, '0', '3', '23.4', '', 'S'),
#        (890, '1', '1', '30', 'C148', 'C'), (891, '0', '3', '7.75', '', 'Q')], 
#       dtype=[('f0', '<i4'), ('f1', '|S4'), ('f2', '|S4'), ('f3', '|S4'), ('f4', '|S4'), ('f5', '|S4')])

3 Comments

Does someone know whether an in-place conversion is possible or how a record array would be constructed from a standard NumPy array? @PadraicCunningham, maybe?
Not sure about inplace but if data was a list of python lists you could np.array(list(map(tuple, data)), dtype="i4,S4,S4,S4,S4,S4"), if it was an array you could np.core.records.fromarrays(data.T,dtype="i4,S4,S4,S4,S4,S4"))
Inplace conversions have to leave the total data buffer size unchanged. 'i4' dtypes can be changed for 4 'i1' types, or (I think) 4 `s1'. But interpreting strings as ints or floats will change the number of bytes, and can't be done in-place.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.