Numpy summarize one array by values of another

Question

I am trying to find a vectorized way to accomplish the follwing:

Say I have an array of x and y values. Note that the x values are not always ints and CAN be negative:

import numpy as np
x = np.array([-1,-1,-1,3,2,2,2,5,4,4], dtype=float)
y = np.array([0,1,0,1,0,1,0,1,0,1])

I want to group the y array by the sorted, unique values of the x array and summarize the counts for each y class. So the example above would look like this:

array([[ 2.,  1.],
      [ 2.,  1.],
      [ 0.,  1.],
      [ 1.,  1.],
      [ 0.,  1.]])

Where the first column represents the count of '0' values for each unique value of x and the second column represents the count of '1' values for each unique value of x.

My current implementation looks like this:

x_sorted, y_sorted = x[x.argsort()], y[x.argsort()]

def collapse(x_sorted, y_sorted):
     uniq_ids = np.unique(x_sorted, return_index=True)[1]
     y_collapsed = np.zeros((len(uniq_ids), 2))
     x_collapsed = x_sorted[uniq_ids]
     for idx, y in enumerate(np.split(y_sorted, uniq_ids[1:])):
          y_collapsed[idx,0] = (y == 0).sum()
          y_collapsed[idx,1] = (y == 1).sum()
     return (x_collapsed, y_collapsed)

collapse(x_sorted, y_sorted)
(array([-1, 2, 3, 4, 5]),
 array([[ 2.,  1.],
       [ 2.,  1.],
       [ 0.,  1.],
       [ 1.,  1.],
       [ 0.,  1.]]))

This doesn't seem very much in the spirit of numpy, however, and I'm hoping some vectorized method exists for this kind of operation. I am trying to do this without resorting to pandas. I know that library has a very convenient groupby operation.

CT Zhu · Accepted Answer · 2014-03-20 03:29:29Z

4

Since x is float. I would do this:

In [136]:

np.array([(x[y==0]==np.unique(x)[..., np.newaxis]).sum(axis=1),
          (x[y==1]==np.unique(x)[..., np.newaxis]).sum(axis=1)]).T
Out[136]:
array([[2, 1],
       [2, 1],
       [0, 1],
       [1, 1],
       [0, 1]])

Speed:

In [152]:

%%timeit
ux=np.unique(x)[..., np.newaxis]
np.array([(x[y==0]==ux).sum(axis=1),
          (x[y==1]==ux).sum(axis=1)]).T
10000 loops, best of 3: 92.7 µs per loop

Solution @seikichi

In [151]:

%%timeit
>>> x = np.array([1.1, 1.1, 1.1, 3.3, 2.2, 2.2, 2.2, 5.5, 4.4, 4.4])
>>> y = np.array([0, 1, 0, 1, 0, 1, 0, 1, 0, 1])
>>> r = np.r_[np.unique(x), np.inf]
>>> np.concatenate([[np.histogram(x[y == v], r)[0]] for v in sorted(set(y))]).T
1000 loops, best of 3: 388 µs per loop

For more general cases when y is not just {0,1}, as @askewchan pointed out:

In [155]:

%%timeit
ux=np.unique(x)[..., np.newaxis]
uy=np.unique(y)
np.asanyarray([(x[y==v]==ux).sum(axis=1) for v in uy]).T
10000 loops, best of 3: 116 µs per loop

To explain the broadcasting further, see this example:

In [5]:

np.unique(a)
Out[5]:
array([ 0. ,  0.2,  0.4,  0.5,  0.6,  1.1,  1.5,  1.6,  1.7,  2. ])
In [8]:

np.unique(a)[...,np.newaxis] #what [..., np.newaxis] will do:
Out[8]:
array([[ 0. ],
       [ 0.2],
       [ 0.4],
       [ 0.5],
       [ 0.6],
       [ 1.1],
       [ 1.5],
       [ 1.6],
       [ 1.7],
       [ 2. ]])
In [10]:

(a==np.unique(a)[...,np.newaxis]).astype('int') #then we can boardcast (converted to int for readability)
Out[10]:
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0]])
In [11]:

(a==np.unique(a)[...,np.newaxis]).sum(axis=1) #getting the count of unique value becomes summing among the 2nd axis
Out[11]:
array([1, 3, 1, 1, 2, 1, 1, 1, 1, 3])

edited Mar 20, 2014 at 3:29

answered Mar 19, 2014 at 20:31

CT Zhu

54.6k18 gold badges125 silver badges136 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

askewchan Over a year ago

You run np.unique(x) twice with each call. Also you did the sorted(set(y)) manually, which only works for small sets of y.

askewchan Over a year ago

Well you included timings :P

Zelazny7 Over a year ago

Thanks, this is the one I'm accepting as it is indeed faster for my use case. Can you explain or point me to where I can read about the [..., np.newaxis] pattern? I have never seen that before.

askewchan Over a year ago

It is equivalent to a.shape = a.shape + (1,) so that you can "broadcast" with it. Very useful, so play around with it: docs.scipy.org/doc/numpy/user/basics.broadcasting.html

CT Zhu Over a year ago

@Zelazny7, besides ackewchan's link, I also added an example. See above.

|

seikichi · Accepted Answer · 2014-03-19 20:43:53Z

4

How about the following code? (use numpy.bincount and numpy.concatenate)

>>> import numpy as np
>>> x = np.array([1,1,1,3,2,2,2,5,4,4])
>>> y = np.array([0,1,0,1,0,1,0,1,0,1])
>>> xmax = x.max()
>>> numpy.concatenate([[numpy.bincount(x[y == v], minlength=xmax + 1)] for v in sorted(set(y))], axis=0)[:, 1:].T
array([[2, 1],
       [2, 1],
       [0, 1],
       [1, 1],
       [0, 1]])

UPDATE : Thanks @askewchan !

>>> import numpy as np
>>> x = np.array([1.1, 1.1, 1.1, 3.3, 2.2, 2.2, 2.2, 5.5, 4.4, 4.4])
>>> y = np.array([0, 1, 0, 1, 0, 1, 0, 1, 0, 1])
>>> r = np.r_[np.unique(x), np.inf]
>>> np.array([np.histogram(x[y == v], r)[0] for v in sorted(set(y))]).T
array([[2, 1],
       [2, 1],
       [0, 1],
       [1, 1],
       [0, 1]])

edited Mar 19, 2014 at 20:43

answered Mar 19, 2014 at 19:33

seikichi

1,29113 silver badges15 bronze badges

3 Comments

Zelazny7 Over a year ago

Sorry, neglected to include that my x values are not necessarily int

askewchan Over a year ago

You could replace bincount with np.histogram(x, np.r_[np.unique(x), np.inf]))

askewchan Over a year ago

Or even better: np.bincount(np.unique(x, return_inverse=True)[1])

Jaime · Accepted Answer · 2014-03-19 20:31:41Z

3

np.unique and np.bincount are your friends here. The following should work for any type of the inputs, not neccessarily small consecutive integers:

>>> x = np.array([1, 1, 1, 3, 2, 2, 2, 5, 4, 4])
>>> y = np.array([0, 1, 2, 2, 0, 1, 0, 2, 2, 1])
>>> 
>>> x_unq, x_idx = np.unique(x, return_inverse=True)
>>> y_unq, y_idx = np.unique(y, return_inverse=True)
>>> 
>>> np.column_stack(np.bincount(x_idx, y_idx == j) for j in range(len(y_unq)))
array([[ 1.,  1.,  1.],
       [ 2.,  1.,  0.],
       [ 0.,  0.,  1.],
       [ 0.,  1.,  1.],
       [ 0.,  0.,  1.]])

You can extract the row and column labels also:

>>> x_unq
array([1, 2, 3, 4, 5])
>>> y_unq
array([0, 1, 2])

answered Mar 19, 2014 at 20:31

Jaime

67.7k19 gold badges128 silver badges164 bronze badges

Comments

acushner · Accepted Answer · 2014-03-19 19:47:41Z

2

I haven't tested this but I think it should work. Basically all I do is grab the values from y based on x being the value in question.

uniques = list(set(x))
uniques.sort()
lu = len(uniques)
res = np.zeros(lu * 2).reshape(lu, 2)
for i, v in enumerate(uniques):
    cur = y[x == v]
    s = cur.sum()
    res[i, 0] = len(cur) - s
    res[i, 1] = s

another way is to use numpy MaskedArrays

edited Mar 19, 2014 at 19:47

answered Mar 19, 2014 at 19:38

acushner

9,9461 gold badge38 silver badges37 bronze badges

Comments

Saullo G. P. Castro · Accepted Answer · 2014-03-19 20:19:23Z

2

Here is another solution:

y = y[np.argsort(x)]

b = np.bincount(x)
b = b[b!=0]

ans = np.array([[i.shape[0], i.sum()] for i in np.split(y, np.cumsum(b))[:-1]])

ans[:,0] -= ans[:,1]

print(ans)
#array([[2, 1],
#   [2, 1],
#   [0, 1],
#   [1, 1],
#   [0, 1]], dtype=int64)

Timing:

 @seikichi solution:
 10000 loops, best of 3: 37.2 µs per loop

 @acushner solution:
 10000 loops, best of 3: 65.4 µs per loop

 @SaulloCastro solution:
 10000 loops, best of 3: 154 µs per loop

edited Mar 19, 2014 at 20:19

answered Mar 19, 2014 at 20:11

Saullo G. P. Castro

59.4k28 gold badges191 silver badges244 bronze badges

Collectives™ on Stack Overflow

Numpy summarize one array by values of another

5 Answers 5

7 Comments

3 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

7 Comments

3 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related