Numpy Array - Drop Duplicates

Question

I am trying to remove duplicate elements from a numpy array.

Eg:
a = np.array([[0.03,0.32],[0.09,0.26],[0.03,0.32]])
a = np.unique(a,axis=0)

This is perfectly working. But the problem is this code is a part of a function. And I run the function say 10 times. At any one run the system gets hanged at exactly this line. I notice that array would be of max 3500 size and each element (inner array) would be of length 60. Why is this happening or any other efficient way?

np.unique(np.random.random((3500, 60)), axis=0) works perfectly fine. Can you provide some data which reproduces your problem? — jpp
– jpp, Commented Oct 28, 2018 at 16:22
Do you really need python3 AND python2 compatible code? If not, remove the wrong one and add the "pure" python tag: python — Patrick Artner
– Patrick Artner, Commented Oct 28, 2018 at 17:42

Matt Messersmith · Accepted Answer · 2018-10-28 16:59:12Z

There's quite a few issues with what you're doing.

First, observe that np.unique does not work well for floating point arithmetic, and will not in general filter out "unique" arrays of floats:

In [16]: a = np.array([[2.1*3, .123], [6.3, 2.05*.06]])

In [17]: a
Out[17]: 
array([[6.3  , 0.123],
       [6.3  , 0.123]])

In [18]: np.unique(a, axis=0)
Out[18]: 
array([[6.3  , 0.123],
       [6.3  , 0.123]])

Note that the duplicates are still in the result after calling np.unique. The reason for this is because np.unique is comparing on equality meaning, that the floats must match bit for bit. However, floating point arithmetic is not exact, so you are not guaranteed to filter out duplicates correctly.

Secondly, in terms of performance, you can do better than np.unique with a hashable type. np.unique will always run in O(n log n) since it does a sort. You can verify this in the source code:

if optional_indices:
    perm = ar.argsort(kind='mergesort' if return_index else 'quicksort')
    aux = ar[perm]
else:
    ar.sort()
    aux = ar

So, regardless of how the conditional evaluates, a sort is performed over ar (which is the input array, see here for more detail: https://github.com/numpy/numpy/blob/v1.15.0/numpy/lib/arraysetops.py#L277). The reason for this is because np.unique supports a rich set of functionality (like getting the indices of dups, returning the count of dups, etc).

You don't have to sort to get unique elements. If you beat your type into a hashable type (like tuple), then you can filter out duplicates in O(n), linear time. Here's an example:

In [37]: b
Out[37]: 
[(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
 (0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
 (0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
 (0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
 (0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
 (0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
 (0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
 (0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
 (0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
 (0, 1, 2, 3, 4, 5, 6, 7, 8, 9)]

In [39]: np.unique(b, axis=0)
Out[39]: array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])

In [40]: set(b)
Out[40]: {(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)}

In [41]: %timeit np.unique(b, axis=0)
21.9 µs ± 132 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [42]: %timeit set(b)
627 ns ± 5.09 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

So, as you can see, just using the built-in set runs about 30x faster than np.unique. Please note this will not work correctly for arrays of floats, but I just wanted to show that np.unique is not particularly performant from an algorithmic perspective.

Lastly, 3500x60 is not really that big. You can loop through that pretty easily, even with a subpar algorithm, and it should not hang on any modern hardware. It should run pretty fast:

In [43]: np.random.seed(0)

In [46]: x = np.random.random((3500, 60))

In [49]: %timeit np.unique(x, axis=0)
2.57 ms ± 17.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So it takes 2.57 millisecond on my MacBook Pro, which isn't exactly a powerhouse in terms of hardware (2.3 GHz i5, 8GB of RAM). Make sure you're profiling your code, and make sure that the line in this question is actually the trouble line.

HTH.

Thank you for such a detailed explanation. It really helped. I understand 3500*60 is not that big that is why it got me worried. So now for float how can we achieve it better than using np.unique

Collectives™ on Stack Overflow

Numpy Array - Drop Duplicates

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related