3

I am trying to find a vectorized way to accomplish the follwing:

Say I have an array of x and y values. Note that the x values are not always ints and CAN be negative:

import numpy as np
x = np.array([-1,-1,-1,3,2,2,2,5,4,4], dtype=float)
y = np.array([0,1,0,1,0,1,0,1,0,1])

I want to group the y array by the sorted, unique values of the x array and summarize the counts for each y class. So the example above would look like this:

array([[ 2.,  1.],
      [ 2.,  1.],
      [ 0.,  1.],
      [ 1.,  1.],
      [ 0.,  1.]])

Where the first column represents the count of '0' values for each unique value of x and the second column represents the count of '1' values for each unique value of x.

My current implementation looks like this:

x_sorted, y_sorted = x[x.argsort()], y[x.argsort()]

def collapse(x_sorted, y_sorted):
     uniq_ids = np.unique(x_sorted, return_index=True)[1]
     y_collapsed = np.zeros((len(uniq_ids), 2))
     x_collapsed = x_sorted[uniq_ids]
     for idx, y in enumerate(np.split(y_sorted, uniq_ids[1:])):
          y_collapsed[idx,0] = (y == 0).sum()
          y_collapsed[idx,1] = (y == 1).sum()
     return (x_collapsed, y_collapsed)

collapse(x_sorted, y_sorted)
(array([-1, 2, 3, 4, 5]),
 array([[ 2.,  1.],
       [ 2.,  1.],
       [ 0.,  1.],
       [ 1.,  1.],
       [ 0.,  1.]]))

This doesn't seem very much in the spirit of numpy, however, and I'm hoping some vectorized method exists for this kind of operation. I am trying to do this without resorting to pandas. I know that library has a very convenient groupby operation.

5 Answers 5

4

Since x is float. I would do this:

In [136]:

np.array([(x[y==0]==np.unique(x)[..., np.newaxis]).sum(axis=1),
          (x[y==1]==np.unique(x)[..., np.newaxis]).sum(axis=1)]).T
Out[136]:
array([[2, 1],
       [2, 1],
       [0, 1],
       [1, 1],
       [0, 1]])

Speed:

In [152]:

%%timeit
ux=np.unique(x)[..., np.newaxis]
np.array([(x[y==0]==ux).sum(axis=1),
          (x[y==1]==ux).sum(axis=1)]).T
10000 loops, best of 3: 92.7 µs per loop

Solution @seikichi

In [151]:

%%timeit
>>> x = np.array([1.1, 1.1, 1.1, 3.3, 2.2, 2.2, 2.2, 5.5, 4.4, 4.4])
>>> y = np.array([0, 1, 0, 1, 0, 1, 0, 1, 0, 1])
>>> r = np.r_[np.unique(x), np.inf]
>>> np.concatenate([[np.histogram(x[y == v], r)[0]] for v in sorted(set(y))]).T
1000 loops, best of 3: 388 µs per loop

For more general cases when y is not just {0,1}, as @askewchan pointed out:

In [155]:

%%timeit
ux=np.unique(x)[..., np.newaxis]
uy=np.unique(y)
np.asanyarray([(x[y==v]==ux).sum(axis=1) for v in uy]).T
10000 loops, best of 3: 116 µs per loop

To explain the broadcasting further, see this example:

In [5]:

np.unique(a)
Out[5]:
array([ 0. ,  0.2,  0.4,  0.5,  0.6,  1.1,  1.5,  1.6,  1.7,  2. ])
In [8]:

np.unique(a)[...,np.newaxis] #what [..., np.newaxis] will do:
Out[8]:
array([[ 0. ],
       [ 0.2],
       [ 0.4],
       [ 0.5],
       [ 0.6],
       [ 1.1],
       [ 1.5],
       [ 1.6],
       [ 1.7],
       [ 2. ]])
In [10]:

(a==np.unique(a)[...,np.newaxis]).astype('int') #then we can boardcast (converted to int for readability)
Out[10]:
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0]])
In [11]:

(a==np.unique(a)[...,np.newaxis]).sum(axis=1) #getting the count of unique value becomes summing among the 2nd axis
Out[11]:
array([1, 3, 1, 1, 2, 1, 1, 1, 1, 3])
Sign up to request clarification or add additional context in comments.

7 Comments

You run np.unique(x) twice with each call. Also you did the sorted(set(y)) manually, which only works for small sets of y.
Well you included timings :P
Thanks, this is the one I'm accepting as it is indeed faster for my use case. Can you explain or point me to where I can read about the [..., np.newaxis] pattern? I have never seen that before.
It is equivalent to a.shape = a.shape + (1,) so that you can "broadcast" with it. Very useful, so play around with it: docs.scipy.org/doc/numpy/user/basics.broadcasting.html
@Zelazny7, besides ackewchan's link, I also added an example. See above.
|
4

How about the following code? (use numpy.bincount and numpy.concatenate)

>>> import numpy as np
>>> x = np.array([1,1,1,3,2,2,2,5,4,4])
>>> y = np.array([0,1,0,1,0,1,0,1,0,1])
>>> xmax = x.max()
>>> numpy.concatenate([[numpy.bincount(x[y == v], minlength=xmax + 1)] for v in sorted(set(y))], axis=0)[:, 1:].T
array([[2, 1],
       [2, 1],
       [0, 1],
       [1, 1],
       [0, 1]])

UPDATE : Thanks @askewchan !

>>> import numpy as np
>>> x = np.array([1.1, 1.1, 1.1, 3.3, 2.2, 2.2, 2.2, 5.5, 4.4, 4.4])
>>> y = np.array([0, 1, 0, 1, 0, 1, 0, 1, 0, 1])
>>> r = np.r_[np.unique(x), np.inf]
>>> np.array([np.histogram(x[y == v], r)[0] for v in sorted(set(y))]).T
array([[2, 1],
       [2, 1],
       [0, 1],
       [1, 1],
       [0, 1]])

3 Comments

Sorry, neglected to include that my x values are not necessarily int
You could replace bincount with np.histogram(x, np.r_[np.unique(x), np.inf]))
Or even better: np.bincount(np.unique(x, return_inverse=True)[1])
3

np.unique and np.bincount are your friends here. The following should work for any type of the inputs, not neccessarily small consecutive integers:

>>> x = np.array([1, 1, 1, 3, 2, 2, 2, 5, 4, 4])
>>> y = np.array([0, 1, 2, 2, 0, 1, 0, 2, 2, 1])
>>> 
>>> x_unq, x_idx = np.unique(x, return_inverse=True)
>>> y_unq, y_idx = np.unique(y, return_inverse=True)
>>> 
>>> np.column_stack(np.bincount(x_idx, y_idx == j) for j in range(len(y_unq)))
array([[ 1.,  1.,  1.],
       [ 2.,  1.,  0.],
       [ 0.,  0.,  1.],
       [ 0.,  1.,  1.],
       [ 0.,  0.,  1.]])

You can extract the row and column labels also:

>>> x_unq
array([1, 2, 3, 4, 5])
>>> y_unq
array([0, 1, 2])

Comments

2

I haven't tested this but I think it should work. Basically all I do is grab the values from y based on x being the value in question.

uniques = list(set(x))
uniques.sort()
lu = len(uniques)
res = np.zeros(lu * 2).reshape(lu, 2)
for i, v in enumerate(uniques):
    cur = y[x == v]
    s = cur.sum()
    res[i, 0] = len(cur) - s
    res[i, 1] = s

another way is to use numpy MaskedArrays

Comments

2

Here is another solution:

y = y[np.argsort(x)]

b = np.bincount(x)
b = b[b!=0]

ans = np.array([[i.shape[0], i.sum()] for i in np.split(y, np.cumsum(b))[:-1]])

ans[:,0] -= ans[:,1]

print(ans)
#array([[2, 1],
#   [2, 1],
#   [0, 1],
#   [1, 1],
#   [0, 1]], dtype=int64)

Timing:

 @seikichi solution:
 10000 loops, best of 3: 37.2 µs per loop

 @acushner solution:
 10000 loops, best of 3: 65.4 µs per loop

 @SaulloCastro solution:
 10000 loops, best of 3: 154 µs per loop

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.