9

I have two numpy arrays, one containing values and one containing each values category.

values=np.array([1,2,3,4,5,6,7,8,9,10])
valcats=np.array([101,301,201,201,102,302,302,202,102,301])

I have another array containing the unique categories I'd like to sum across.

categories=np.array([101,102,201,202,301,302])

My issue is that I will be running this same summing process a few billion times and every microsecond matters.

My current implementation is as follows.

catsums=[]
for x in categories:
    catsums.append(np.sum(values[np.where(valcats==x)]))

The resulting catsums should be:

[1, 14, 7, 8, 12, 13]

My current run time is about 5 µs. I am somewhat new still to Python and was hoping to find a fast solution by potentially combining the first two arrays or lamdba or something cool I don't even know about.

Thanks for reading!

3
  • 1
    what is your expected output considering the example you gave? Commented Jul 23, 2017 at 16:26
  • Added to text, thanks for pointing out that oversight! Commented Jul 23, 2017 at 16:32
  • 1
    upvoted your question , You have 15 rep now ,feel free to upvote and accept @piRSquared answer Commented Jul 23, 2017 at 16:32

2 Answers 2

8

You can use searchsorted and bincount -

np.bincount(np.searchsorted(categories, valcats), values)
Sign up to request clarification or add additional context in comments.

5 Comments

What would you add if the categories array wasn't already sorted?
@piRSquared I would sort it and then feed it into the solution.
I was thinking more along the lines of passing the sorter parameter... , sorter=categories.argsort()
@piRSquared There's a lot of overhead involved. Don't think its worth it for a case like this.
@hrschbck You can do : ids = np.searchsorted(categories, valcats) and then df.groupby('ids')['values'].max().values, if you were looking for an easy way. For sum, you would reuse ids : np.bincount(ids, values).
8

@Divakar just posted a very good answer. If you already have the array of categories defined, I'd use @Divakar's answer. If you don't have unique values already define, I'd use mine.


I'd use pd.factorize to factorize the categories. Then use np.bincount with weights parameter set to be the values array

f, u = pd.factorize(valcats)
np.bincount(f, values).astype(values.dtype)

array([ 1, 12,  7, 14, 13,  8])

pd.factorize also produces the unique values in the u variable. We can line up the results with u to see that we've arrived at the correct solution.

np.column_stack([u, np.bincount(f, values).astype(values.dtype)])

array([[101,   1],
       [301,  12],
       [201,   7],
       [102,  14],
       [302,  13],
       [202,   8]])

You can make this more obvious using a pd.Series

f, u = pd.factorize(valcats)
pd.Series(np.bincount(f, values).astype(values.dtype), u)

101     1
301    12
201     7
102    14
302    13
202     8
dtype: int64

Why pd.factorize and not np.unique?

We could have done this equivalently with

 u, f = np.unique(valcats, return_inverse=True)

But, np.unique sorts the values and that runs in nlogn time. On the other hand pd.factorize does not sort and runs in linear time. For larger data sets, pd.factorize will dominate performance.

3 Comments

Nice solution ~ +1
Thank you @Wen (-:
That is a new question... and I have an answer for you.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.