1

I have written a function which takes an N by N array and compute an output array based on it.

heres how my code looks like this:

 def calculate_output(input,N):
    output = np.zeros((N, N))
    for y in range(N):
        for x in range(N):
            val1 = 0 if y-1<0 else output[y-1][x]+input[y][x]
            val2 = 0 if x-1<0 else output[y][x-1]+input[y][x]
            output[y][x] = max(val1,val2)
    return output

N = 10000
input = np.reshape(np.random.binomial(1, [0.25] * N * N), (N, N))
output =calculate_output(input,N)

however this compution is not fast enough and takes about 300 seconds on my machine.(compared to 3 seconds when implemented on C++) is there any way to improve this without writing a C extension? I have tries using pypy but in this case the code is even slower using pypy

3
  • 1
    This is your attempted solution for this, yes? Commented Jul 2, 2021 at 13:06
  • If yes, your solution almost does that. To be correct you need to have output[y][x] = input[y][x] + max(val1,val2) instead of adding the input in the conditional, since we count the weight of the starting square. Commented Jul 2, 2021 at 13:22
  • HI, indeed it is I applied your advice and it did infact improve the performannce. however, it is till quite slow, is there any way to improve this? Commented Jul 2, 2021 at 14:36

1 Answer 1

2

CPython is very slow because it is an interpreter and it clearly cannot compete with C and C++ in such a case. The usual approach to reduce the cost of the interpreter is to avoid loops as much as possible and use few Numpy vectorized calls instead. However in this case, it is barely possible to write an efficient implementation using Numpy vectorized calls.

On the other hand PyPy is often much better for numerical codes because of the JIT compilation. But its implementation of Numpy is not great at all mainly because they used an implementation of Numpy rewritten in Python which is not as good as the native Numpy implementation and the native implementation would not be efficient because of the way Python modules are currently implemented. To put it shortly, AFAIK, the PyPy JIT cannot optimize Numpy access with the native implementation. As the result, the JIT can be slower than the CPython interpreter in your case.

However, you can speed up the code a lot using the Numba JIT compiler which has been written for this exact use-case. Moreover, few optimizations can be implemented to speed up the code even more (whatever the programming language used):

  • conditionals are generally slow, you can move them in loops performing only the borders
  • writing zeros initially in the output matrix is not required and is actually slower
  • Using 2D direct indexing is cleaner and likely a bit faster
  • integers can be used instead of floating-point numbers since the output contains only integers and computing integers is faster than computing the same operation with floating-point numbers.
import numba as nb

@nb.njit(['int32[:,::1](int32[:,::1],int32)', 'int64[:,::1](int64[:,::1],int64)'])
def calculate_output(input,N):
    output = np.empty((N, N), input.dtype)
    for x in range(0,N):
        val2 = 0 if x-1<0 else output[0,x-1]+input[0,x]
        output[0,x] = max(0,val2)
    for y in range(1,N):
        val1 = 0 if y-1<0 else output[y-1,0]+input[y,0]
        output[y,0] = max(val1,0)
    for y in range(1,N):
        for x in range(1,N):
            val1 = output[y-1,x]+input[y,x]
            val2 = output[y,x-1]+input[y,x]
            output[y,x] = max(val1,val2)
    return output

The resulting calculate_output call is 730 times faster on my machine.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.