7

I need to organized a data file with chunks of named data. Data is NUMPY arrays. But I don't want to use numpy.save or numpy.savez function, because in some cases, data have to be sent on a server over a pipe or other interface. So I want to dump numpy array into memory, zip it, and then, send it into a server.

I've tried simple pickle, like this:

try:
    import cPickle as pkl
except:
    import pickle as pkl
import ziplib
import numpy as np

def send_to_db(data, compress=5):
     send( zlib.compress(pkl.dumps(data),compress) )

.. but this is extremely slow process.

Even with compress level 0 (without compression), the process is very slow and just because of pickling.

Is there any way to dump numpy array into string without pickle? I know that numpy allows to get buffer numpy.getbuffer, but it isn't obvious to me, how to use this dumped buffer to obtaine an array back.

5
  • Wait, why don't you want to use numpy.save? It will be the fastest, most portable way... Commented May 11, 2017 at 21:13
  • @juanpa.arrivillaga I need to stream results of simulations from cluster nodes into head node, to save there. So I want zip it before sending. numpy.save is very fast, you right, but unfortunately it works only with files, not with memory? Commented May 11, 2017 at 21:22
  • A "file" is just an abstraction. You can still do it in-memory. See my answer. Commented May 11, 2017 at 21:24
  • The 'pickle' method for an ndarray is its save function. Commented May 11, 2017 at 21:51
  • @hpaulj are you sure? why when juanpa.arrivillaga solution is so wonderfully fast and pickle code is so horrible slow? Commented May 11, 2017 at 22:09

3 Answers 3

12

You should definitely use numpy.save, you can still do it in-memory:

>>> import io
>>> import numpy as np
>>> import zlib
>>> f = io.BytesIO()
>>> arr = np.random.rand(100, 100)
>>> np.save(f, arr)
>>> compressed = zlib.compress(f.getbuffer())

And to decompress, reverse the process:

>>> np.load(io.BytesIO(zlib.decompress(compressed)))
array([[ 0.80881898,  0.50553303,  0.03859795, ...,  0.05850996,
         0.9174782 ,  0.48671767],
       [ 0.79715979,  0.81465744,  0.93529834, ...,  0.53577085,
         0.59098735,  0.22716425],
       [ 0.49570713,  0.09599001,  0.74023709, ...,  0.85172897,
         0.05066641,  0.10364143],
       ...,
       [ 0.89720137,  0.60616688,  0.62966729, ...,  0.6206728 ,
         0.96160519,  0.69746633],
       [ 0.59276237,  0.71586014,  0.35959289, ...,  0.46977027,
         0.46586237,  0.10949621],
       [ 0.8075795 ,  0.70107856,  0.81389246, ...,  0.92068768,
         0.38013495,  0.21489793]])
>>>

Which, as you can see, matches what we saved earlier:

>>> arr
array([[ 0.80881898,  0.50553303,  0.03859795, ...,  0.05850996,
         0.9174782 ,  0.48671767],
       [ 0.79715979,  0.81465744,  0.93529834, ...,  0.53577085,
         0.59098735,  0.22716425],
       [ 0.49570713,  0.09599001,  0.74023709, ...,  0.85172897,
         0.05066641,  0.10364143],
       ...,
       [ 0.89720137,  0.60616688,  0.62966729, ...,  0.6206728 ,
         0.96160519,  0.69746633],
       [ 0.59276237,  0.71586014,  0.35959289, ...,  0.46977027,
         0.46586237,  0.10949621],
       [ 0.8075795 ,  0.70107856,  0.81389246, ...,  0.92068768,
         0.38013495,  0.21489793]])
>>>
Sign up to request clarification or add additional context in comments.

6 Comments

@rth yup, it's super handy.
But can you give an example for decompression?
@rth sure, check it out
Note: Unless your data has a lot of repetition, compression probably won't gain you much, and may slow you down. It's worth checking if compression is worth it on total transfer time; it might be faster (and less memory intensive), if you're using a stream-oriented pipe/socket/whatever, to pass it as the argument for numpy.save; depending on implementation, you might manage to begin writing to the socket immediately, without additional memory overhead.
And of course, using proper serialization (instead of pickle protocol 0) should reduce size by a factor of ~3x without compression.
|
3

THe default pickle method provides a pure ascii output. To get (much) better performance, use the latest version available. Versions 2 and above are binary and, if memory serves me right, allows numpy arrays to dump their buffer directly into the stream without addtional operations.

To select version to use, add the optional argument while pickling (no need to specify it while unpickling), for instance pkl.dumps(data, 2). To pick the latest possible version, use pkl.dumps(data, -1)

Note that if you use different python versions, you need to specify the lowest supported version. See Pickle documentation for details on the different versions

1 Comment

Note: This not only runs much faster, it also produces much smaller pickles. In local microbenchmarks, for a 10x10 numpy array (produced with numpy.random.random_sample), pickling with protocol 0 took ~80 µs and produced output 2419 bytes of output, while protocol 2 took ~10 µs and produced only 934 bytes of output (100 C doubles would need a minimum of 800 bytes to store). For 1000x1000, the times go to 6.24 ms for protocol 0 and 220 KB, vs. 22 µs for protocol 2 and 80,134 bytes (looks like a fixed 134 bytes to store type and shape).
1

There is a method tobytes which, according to my benchmarks is faster than other alternatives.

Take with a grain of salt, as some of my experiments may be misguided or plainly wrong, but it is a method of dumping numpy array into strings.

Keep in mind that you will need to have some additional data out of band, mainly the data type of the array and also its shape. That may be a deal breaker or it may not be rellevant. It's easy to recover the original shape by calling .fromstring(..., dtype=...).reshape(...).


Edit: A maybe incomplete example

##############
# Generation #
##############
import numpy as np

arr = np.random.randint(1, 7, (4,6))
arr_dtype = arr.dtype.str
arr_shape = arr.shape
arr_data = arr.tobytes()

# Now send / store arr_dtype, arr_shape, arr_data, where:
# arr_dtype is string
# arr_shape is tuple of integers
# arr_data is bytes

############
# Recovery #
############

arr = np.frombuffer(arr_data, dtype=arr_dtype).reshape(arr_shape)

I am not considering the column/row ordering, because I know that numpy supports things about that but I have never used it. If you want to support / need to have the memory arranged in a specific fashion --regarding row/column for multidimensional arrays-- you may need to take that into account at some point.

Also: frombuffer doesn't copy the buffer data, it creates the numpy structure as a view (maybe not exactly that, but you know what I mean). If that's undesired behaviour you can use fromstring (which is deprecated but seems to work on 1.19) or use frombuffer followed by a np.copy.

5 Comments

thank you for sharing. Could you please elaborate on the answer and provide some minimal example? It will be very handy if you can show how one can recover the shape.
@rth the answer contains now an example on that. Is that what you had in mind? The benchmarking will depend on the specific needs of the serialization and deserialization, so that should be done in a per-case basis.
This can have problems with endianness
the problem with tobytes and frombuffer is that it doesn't handle all the necessary information to rebuild the array (shape, dtype, etc etc) and the necessary information to recover the equivalent array on a different architecture (endianness). numpy.save is the "correct" way to serialize a numpy.ndarray object, when portability is an issue (e.g. when sending it to a server ...)
@juanpa.arrivillaga You are completely right, and I would keep your answer as it is. But OP was asking with a focus on speed, and it would not surprise me if using those lower-level mechanisms is faster than relying on numpy.save. This assumes you know your data structures (and/or require you to send it too) and places more burden onto the programmer. But hey, that depends on a benchmark that I have not done and the importance of performance that you have. Probably a trade-off.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.