CPython is very slow because it is an interpreter and it clearly cannot compete with C and C++ in such a case. The usual approach to reduce the cost of the interpreter is to avoid loops as much as possible and use few Numpy vectorized calls instead. However in this case, it is barely possible to write an efficient implementation using Numpy vectorized calls.
On the other hand PyPy is often much better for numerical codes because of the JIT compilation. But its implementation of Numpy is not great at all mainly because they used an implementation of Numpy rewritten in Python which is not as good as the native Numpy implementation and the native implementation would not be efficient because of the way Python modules are currently implemented. To put it shortly, AFAIK, the PyPy JIT cannot optimize Numpy access with the native implementation. As the result, the JIT can be slower than the CPython interpreter in your case.
However, you can speed up the code a lot using the Numba JIT compiler which has been written for this exact use-case. Moreover, few optimizations can be implemented to speed up the code even more (whatever the programming language used):
- conditionals are generally slow, you can move them in loops performing only the borders
- writing zeros initially in the output matrix is not required and is actually slower
- Using 2D direct indexing is cleaner and likely a bit faster
- integers can be used instead of floating-point numbers since the output contains only integers and computing integers is faster than computing the same operation with floating-point numbers.
import numba as nb
@nb.njit(['int32[:,::1](int32[:,::1],int32)', 'int64[:,::1](int64[:,::1],int64)'])
def calculate_output(input,N):
output = np.empty((N, N), input.dtype)
for x in range(0,N):
val2 = 0 if x-1<0 else output[0,x-1]+input[0,x]
output[0,x] = max(0,val2)
for y in range(1,N):
val1 = 0 if y-1<0 else output[y-1,0]+input[y,0]
output[y,0] = max(val1,0)
for y in range(1,N):
for x in range(1,N):
val1 = output[y-1,x]+input[y,x]
val2 = output[y,x-1]+input[y,x]
output[y,x] = max(val1,val2)
return output
The resulting calculate_output call is 730 times faster on my machine.
output[y][x] = input[y][x] + max(val1,val2)instead of adding the input in the conditional, since we count the weight of the starting square.