1

I am trying to remove duplicate elements from a numpy array.

Eg:
a = np.array([[0.03,0.32],[0.09,0.26],[0.03,0.32]])
a = np.unique(a,axis=0)

This is perfectly working. But the problem is this code is a part of a function. And I run the function say 10 times. At any one run the system gets hanged at exactly this line. I notice that array would be of max 3500 size and each element (inner array) would be of length 60. Why is this happening or any other efficient way?

2
  • 2
    np.unique(np.random.random((3500, 60)), axis=0) works perfectly fine. Can you provide some data which reproduces your problem? Commented Oct 28, 2018 at 16:22
  • 1
    Do you really need python3 AND python2 compatible code? If not, remove the wrong one and add the "pure" python tag: python Commented Oct 28, 2018 at 17:42

1 Answer 1

4

There's quite a few issues with what you're doing.

First, observe that np.unique does not work well for floating point arithmetic, and will not in general filter out "unique" arrays of floats:

In [16]: a = np.array([[2.1*3, .123], [6.3, 2.05*.06]])

In [17]: a
Out[17]: 
array([[6.3  , 0.123],
       [6.3  , 0.123]])

In [18]: np.unique(a, axis=0)
Out[18]: 
array([[6.3  , 0.123],
       [6.3  , 0.123]])

Note that the duplicates are still in the result after calling np.unique. The reason for this is because np.unique is comparing on equality meaning, that the floats must match bit for bit. However, floating point arithmetic is not exact, so you are not guaranteed to filter out duplicates correctly.

Secondly, in terms of performance, you can do better than np.unique with a hashable type. np.unique will always run in O(n log n) since it does a sort. You can verify this in the source code:

if optional_indices:
    perm = ar.argsort(kind='mergesort' if return_index else 'quicksort')
    aux = ar[perm]
else:
    ar.sort()
    aux = ar

So, regardless of how the conditional evaluates, a sort is performed over ar (which is the input array, see here for more detail: https://github.com/numpy/numpy/blob/v1.15.0/numpy/lib/arraysetops.py#L277). The reason for this is because np.unique supports a rich set of functionality (like getting the indices of dups, returning the count of dups, etc).

You don't have to sort to get unique elements. If you beat your type into a hashable type (like tuple), then you can filter out duplicates in O(n), linear time. Here's an example:

In [37]: b
Out[37]: 
[(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
 (0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
 (0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
 (0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
 (0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
 (0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
 (0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
 (0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
 (0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
 (0, 1, 2, 3, 4, 5, 6, 7, 8, 9)]

In [39]: np.unique(b, axis=0)
Out[39]: array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])

In [40]: set(b)
Out[40]: {(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)}

In [41]: %timeit np.unique(b, axis=0)
21.9 µs ± 132 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [42]: %timeit set(b)
627 ns ± 5.09 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

So, as you can see, just using the built-in set runs about 30x faster than np.unique. Please note this will not work correctly for arrays of floats, but I just wanted to show that np.unique is not particularly performant from an algorithmic perspective.

Lastly, 3500x60 is not really that big. You can loop through that pretty easily, even with a subpar algorithm, and it should not hang on any modern hardware. It should run pretty fast:

In [43]: np.random.seed(0)

In [46]: x = np.random.random((3500, 60))

In [49]: %timeit np.unique(x, axis=0)
2.57 ms ± 17.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So it takes 2.57 millisecond on my MacBook Pro, which isn't exactly a powerhouse in terms of hardware (2.3 GHz i5, 8GB of RAM). Make sure you're profiling your code, and make sure that the line in this question is actually the trouble line.

HTH.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for such a detailed explanation. It really helped. I understand 3500*60 is not that big that is why it got me worried. So now for float how can we achieve it better than using np.unique

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.