2

I'm running some simulations that were going too slow, so I profiled my code and found that over 90 percent of the time was being spent converting a (2D) numpy array to a string, as in:

arr = np.ones(25000).reshape(5000,5)
s = '\n'.join('\t'.join([str(x) for x in row]) for row in arr]

I tried a bunch of different solutions (using map, converting the array using astype(str), casting to a list) but most gave only marginal improvement.

Eventually I gave up on trying to convert the array to a string and saved it to a file on its own using np.save(arr), which gave a 2000x(!) speedup. Is there a way to write the array as a text file with similar performance?

0

2 Answers 2

3

Converting a numpy array to human-readable form should never determine the run time of your simulation. In fact, it shouldn't even contribute (significantly).

You should solve this problem on a different level. Ask yourself: how often do you really need to write the array to a file in human-readable form? Does it need to happen so often/regularly that it significantly determines the run time of your code? Is it sufficient to do it only once, when a certain result is there?

When you take this approach, you probably do not need to optimize your current writing method. I may want to give some numbers. Considering your simulation takes about one hour (without writing the result to disk). I think then you agree that it's fine if your code spends another 10 seconds with writing your result to disk, in human-readable form. And it really does not matter if this takes another 10 seconds, 1 second, or 100 seconds.

If for some reason you really need to regularly write your intermediate results to disk for later processing -- minimize the frequency, and use a binary data format.

Sign up to request clarification or add additional context in comments.

6 Comments

Yep, that's what I ended up doing--each simulation only took about 1.5 milliseconds, and then the conversion to a string took about 500 milliseconds.
So -- is your problem solved? If it is not: how many of these short simulations do you need to perform? What is the output file for? For humans or for machines? How large are these output files? Is I/O a limiting factor?
Yeah, the problem is solved. I was just wondering if there is actually a way to write the numpy array to a string on the same order of performance as np.save(fn, arr).
And to answer the other questions; I need 1 million simulations; the output file is for machines (the output here is being read in by another analysis script). I was trying to write a string header with each file to make sure that I could check that the data and the parameters that generated the data didn't get separated. To fix it, I moved the header into its own file in the same folder as the output.
I see. Just that you know, professional HPC software uses the NetCDF file format or Hierarchical Data Format (HDF) for these kinds of things. Storing such data in ASCII format (or at least in human-readable form) requires CPU-costly conversion, requires much more space on disk, and slows processing down, significantly.
|
2

Try using np.savetxt("file",arr). See the documentation here - (http://docs.scipy.org/doc/numpy/reference/generated/numpy.savetxt.html).

1 Comment

I don't quite get the downvotes here since OP doesn't mention they tried this. IMO you're never going to get as fast as savetxt (which is hand optimised for the job in C) in python, so this does answer the question. That said, Jan's answer is best - don't optimise this bit/work out a way to use binary data.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.