1

How can I have sums of an a array based on the unique values of an b array, assuming both have same dimension and shape?

In other words, I expect to have an output consisting of the sums of array b for each value of array a. (In the example below: sum for value 1 = xxx, sum for value 2 = yyy... sum for value 11 = zzz)

a = [[ 5  1 10 11  6]
     [ 5  3  8 10  9]
     [ 2  1 10  8  7]
     [ 7 10  7  8 11]
     [10 10  3  0 11]]
b = [[508 220 316 557 737]
    [625 419 161 736 426]
    [389 608 760 885 232] 
    [396 309 522 204 842]
    [403 831 225 549 797]]
2
  • could you rephrase your question, hard to understand. Commented Dec 23, 2018 at 17:11
  • ok, post the expected result Commented Dec 23, 2018 at 17:13

3 Answers 3

2

You can do that using numpy:

import numpy as np

a = np.array(
    [[ 5,  1, 10, 11,  6],
     [ 5,  3,  8, 10,  9],
     [ 2,  1, 10,  8,  7],
     [ 7, 10,  7,  8, 11],
     [10, 10,  3,  0, 11]])
b = np.array(
    [[508, 220, 316, 557, 737],
    [625, 419, 161, 736, 426],
    [389, 608, 760, 885, 232],
    [396, 309, 522, 204, 842],
    [403, 831, 225, 549, 797]])

values = np.unique(a)
# will be [ 0  1  2  3  5  6  7  8  9 10 11]

out = {}
for value in values:
    out[value] = sum(b[np.where(a==value)])

print(out)
# {0: 549, 1: 828, 2: 389, 3: 644, 5: 1133, 6: 737, 7: 1150, 8: 1250, 9: 426, 10: 3355, 11: 2196}

Or with a dict comprehension, all in one line:

out = {value: sum(b[np.where(a==value)]) for value in np.unique(a)}
Sign up to request clarification or add additional context in comments.

Comments

1

Pandas is a direct and efficient way for such things :

df=pd.DataFrame(data=b.ravel(),index=a.ravel()) 
sums=df.groupby(level=0).sum()

#        0
# 0    549
# 1    828
# 2    389
# 3    644
# 5   1133
# 6    737
# 7   1150
# 8   1250
# 9    426
# 10  3355
# 11  2196

Benchmarks :

a=np.random.randint(0,10**4,size=10**5)
b=np.random.randint(0,10**6,size=10**5)

In [19]: %timeit pd.DataFrame(b,a).groupby(level=0).sum()
58.7 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [20]: %timeit for aa, bb in zip(a,b):result[aa] += bb
223 ms ± 36.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [21]: %timeit for value in np.unique(a): out[value] = np.sum(b[np.where(a==value)])
5.67 s ± 933 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

1 Comment

Thanks! I'm using the solution you recommended. However, I'm working with a Monte Carlos Simulation (10000 df's are supposed to be created when I run the script, but it is returning memory error. Any idea about how to overcome this problem? I'm using Python 2.7 32bit
1

Or manually:

from itertools import chain
from collections import defaultdict

a = [[ 5,  1, 10, 11,  6],
     [ 5,  3,  8, 10,  9],
     [ 2,  1, 10,  8,  7],
     [ 7, 10,  7,  8, 11],
     [10, 10,  3,  0, 11]]
b = [[508, 220, 316, 557, 737],
    [625, 419, 161, 736, 426],
    [389, 608, 760, 885, 232],
    [396, 309, 522, 204, 842],
    [403, 831, 225, 549, 797]]

result = defaultdict(int)

for aa, bb in zip(chain(*a), chain(*b)):
    result[aa] += bb

print(result)

#defaultdict(<class 'int'>, {5: 1133, 1: 828, 10: 3355, 11: 2196, 6: 737, 3: 644, 8: 1250, 9: 426, 2: 389, 7: 1150, 0: 549})

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.