fastest method to dump numpy array into string

Question

I need to organized a data file with chunks of named data. Data is NUMPY arrays. But I don't want to use numpy.save or numpy.savez function, because in some cases, data have to be sent on a server over a pipe or other interface. So I want to dump numpy array into memory, zip it, and then, send it into a server.

I've tried simple pickle, like this:

try:
    import cPickle as pkl
except:
    import pickle as pkl
import ziplib
import numpy as np

def send_to_db(data, compress=5):
     send( zlib.compress(pkl.dumps(data),compress) )

.. but this is extremely slow process.

Even with compress level 0 (without compression), the process is very slow and just because of pickling.

Is there any way to dump numpy array into string without pickle? I know that numpy allows to get buffer numpy.getbuffer, but it isn't obvious to me, how to use this dumped buffer to obtaine an array back.

Wait, why don't you want to use numpy.save? It will be the fastest, most portable way... — juanpa.arrivillaga
– juanpa.arrivillaga, Commented May 11, 2017 at 21:13
@juanpa.arrivillaga I need to stream results of simulations from cluster nodes into head node, to save there. So I want zip it before sending. numpy.save is very fast, you right, but unfortunately it works only with files, not with memory? — rth
– rth, Commented May 11, 2017 at 21:22
A "file" is just an abstraction. You can still do it in-memory. See my answer. — juanpa.arrivillaga
– juanpa.arrivillaga, Commented May 11, 2017 at 21:24
The 'pickle' method for an ndarray is its save function. — hpaulj
– hpaulj, Commented May 11, 2017 at 21:51
@hpaulj are you sure? why when juanpa.arrivillaga solution is so wonderfully fast and pickle code is so horrible slow? — rth
– rth, Commented May 11, 2017 at 22:09

juanpa.arrivillaga · Accepted Answer · 2022-07-12 20:12:56Z

12

You should definitely use numpy.save, you can still do it in-memory:

>>> import io
>>> import numpy as np
>>> import zlib
>>> f = io.BytesIO()
>>> arr = np.random.rand(100, 100)
>>> np.save(f, arr)
>>> compressed = zlib.compress(f.getbuffer())

And to decompress, reverse the process:

>>> np.load(io.BytesIO(zlib.decompress(compressed)))
array([[ 0.80881898,  0.50553303,  0.03859795, ...,  0.05850996,
         0.9174782 ,  0.48671767],
       [ 0.79715979,  0.81465744,  0.93529834, ...,  0.53577085,
         0.59098735,  0.22716425],
       [ 0.49570713,  0.09599001,  0.74023709, ...,  0.85172897,
         0.05066641,  0.10364143],
       ...,
       [ 0.89720137,  0.60616688,  0.62966729, ...,  0.6206728 ,
         0.96160519,  0.69746633],
       [ 0.59276237,  0.71586014,  0.35959289, ...,  0.46977027,
         0.46586237,  0.10949621],
       [ 0.8075795 ,  0.70107856,  0.81389246, ...,  0.92068768,
         0.38013495,  0.21489793]])
>>>

Which, as you can see, matches what we saved earlier:

>>> arr
array([[ 0.80881898,  0.50553303,  0.03859795, ...,  0.05850996,
         0.9174782 ,  0.48671767],
       [ 0.79715979,  0.81465744,  0.93529834, ...,  0.53577085,
         0.59098735,  0.22716425],
       [ 0.49570713,  0.09599001,  0.74023709, ...,  0.85172897,
         0.05066641,  0.10364143],
       ...,
       [ 0.89720137,  0.60616688,  0.62966729, ...,  0.6206728 ,
         0.96160519,  0.69746633],
       [ 0.59276237,  0.71586014,  0.35959289, ...,  0.46977027,
         0.46586237,  0.10949621],
       [ 0.8075795 ,  0.70107856,  0.81389246, ...,  0.92068768,
         0.38013495,  0.21489793]])
>>>

edited Jul 12, 2022 at 20:12

answered May 11, 2017 at 21:20

juanpa.arrivillaga

97.6k14 gold badges141 silver badges190 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

juanpa.arrivillaga Over a year ago

@rth yup, it's super handy.

rth Over a year ago

But can you give an example for decompression?

juanpa.arrivillaga Over a year ago

@rth sure, check it out

ShadowRanger Over a year ago

Note: Unless your data has a lot of repetition, compression probably won't gain you much, and may slow you down. It's worth checking if compression is worth it on total transfer time; it might be faster (and less memory intensive), if you're using a stream-oriented pipe/socket/whatever, to pass it as the argument for numpy.save; depending on implementation, you might manage to begin writing to the socket immediately, without additional memory overhead.

ShadowRanger Over a year ago

And of course, using proper serialization (instead of pickle protocol 0) should reduce size by a factor of ~3x without compression.

|

ilmarinen · Accepted Answer · 2017-05-11 21:19:51Z

3

THe default pickle method provides a pure ascii output. To get (much) better performance, use the latest version available. Versions 2 and above are binary and, if memory serves me right, allows numpy arrays to dump their buffer directly into the stream without addtional operations.

To select version to use, add the optional argument while pickling (no need to specify it while unpickling), for instance pkl.dumps(data, 2). To pick the latest possible version, use pkl.dumps(data, -1)

Note that if you use different python versions, you need to specify the lowest supported version. See Pickle documentation for details on the different versions

answered May 11, 2017 at 21:19

ilmarinen

5,9173 gold badges19 silver badges14 bronze badges

1 Comment

ShadowRanger Over a year ago

Note: This not only runs much faster, it also produces much smaller pickles. In local microbenchmarks, for a 10x10 numpy array (produced with numpy.random.random_sample), pickling with protocol 0 took ~80 µs and produced output 2419 bytes of output, while protocol 2 took ~10 µs and produced only 934 bytes of output (100 C doubles would need a minimum of 800 bytes to store). For 1000x1000, the times go to 6.24 ms for protocol 0 and 220 KB, vs. 22 µs for protocol 2 and 80,134 bytes (looks like a fixed 134 bytes to store type and shape).

MariusSiuram · Accepted Answer · 2020-10-09 13:22:43Z

1

There is a method tobytes which, according to my benchmarks is faster than other alternatives.

Take with a grain of salt, as some of my experiments may be misguided or plainly wrong, but it is a method of dumping numpy array into strings.

Keep in mind that you will need to have some additional data out of band, mainly the data type of the array and also its shape. That may be a deal breaker or it may not be rellevant. It's easy to recover the original shape by calling .fromstring(..., dtype=...).reshape(...).

Edit: A maybe incomplete example

##############
# Generation #
##############
import numpy as np

arr = np.random.randint(1, 7, (4,6))
arr_dtype = arr.dtype.str
arr_shape = arr.shape
arr_data = arr.tobytes()

# Now send / store arr_dtype, arr_shape, arr_data, where:
# arr_dtype is string
# arr_shape is tuple of integers
# arr_data is bytes

############
# Recovery #
############

arr = np.frombuffer(arr_data, dtype=arr_dtype).reshape(arr_shape)

I am not considering the column/row ordering, because I know that numpy supports things about that but I have never used it. If you want to support / need to have the memory arranged in a specific fashion --regarding row/column for multidimensional arrays-- you may need to take that into account at some point.

Also: frombuffer doesn't copy the buffer data, it creates the numpy structure as a view (maybe not exactly that, but you know what I mean). If that's undesired behaviour you can use fromstring (which is deprecated but seems to work on 1.19) or use frombuffer followed by a np.copy.

edited Oct 9, 2020 at 13:22

answered Oct 7, 2020 at 16:48

MariusSiuram

3,7143 gold badges25 silver badges45 bronze badges

5 Comments

rth Over a year ago

thank you for sharing. Could you please elaborate on the answer and provide some minimal example? It will be very handy if you can show how one can recover the shape.

MariusSiuram Over a year ago

@rth the answer contains now an example on that. Is that what you had in mind? The benchmarking will depend on the specific needs of the serialization and deserialization, so that should be done in a per-case basis.

user972014 Over a year ago

This can have problems with endianness

juanpa.arrivillaga Over a year ago

the problem with tobytes and frombuffer is that it doesn't handle all the necessary information to rebuild the array (shape, dtype, etc etc) and the necessary information to recover the equivalent array on a different architecture (endianness). numpy.save is the "correct" way to serialize a numpy.ndarray object, when portability is an issue (e.g. when sending it to a server ...)

MariusSiuram Over a year ago

@juanpa.arrivillaga You are completely right, and I would keep your answer as it is. But OP was asking with a focus on speed, and it would not surprise me if using those lower-level mechanisms is faster than relying on numpy.save. This assumes you know your data structures (and/or require you to send it too) and places more burden onto the programmer. But hey, that depends on a benchmark that I have not done and the importance of performance that you have. Probably a trade-off.

Collectives™ on Stack Overflow

fastest method to dump numpy array into string

3 Answers 3

6 Comments

1 Comment

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

1 Comment

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related