Dropping array rows that DUPLICATE defined column elements of other array rows

Question

Consider the np array sample below:

import numpy as np

arr = np.array([[1,2,5,  4,2,7,  5,2,9],
                [4,4,1,  4,2,0,  3,6,4],
                [1,2,1,  4,2,2,  5,2,0],
                [1,2,7,  2,4,1,  5,2,8],
                [1,2,9,  4,2,8,  5,2,1],
                [4,2,0,  4,4,1,  5,2,4],
                [4,4,0,  4,2,6,  3,6,6],
                [1,2,1,  4,2,2,  5,2,0]])

PROBLEM: We are concerned only with the first TWO columns of each element triplet. I want to remove array rows that duplicate these two elements of each triplet (in the same order).

In the example above, the rows with indices 0,2,4, and 7 are all of the form [1,2,_, 4,2,_, 5,2,_]. So, we should keep arr[0],and drop the other three. Similarly, row[6] is dropped because it has the same pattern as row[1], namely [4,4,_, 4,2,_, 3,6,_]. In the example given, the output should look like:

               [[1,2,5,  4,2,7,  5,2,9],
                [4,4,1   4,2,0,  3,6,4],
                [1,2,7,  2,4,1,  5,2,8],
                [4,2,0,  4,4,1   5,2,4]]

The part I'm struggling with most is that the solution should be general enough to handle arrays of 3, 6, 9, 12... columns. (always a multiple of 3, and we are always interested in duplications of the first two columns of each triplet.

What's the significance of the gap in columns? Is this array (8,9) or (8,3,3) shape? — hpaulj
– hpaulj, Commented Oct 4, 2020 at 23:34
Rather than focus on what you want to remove, pay more attention to what you want to keep. Even when you use a function like np.delete you are really constructing a new array with the selected rows or columns. So identifying what you want to keep (conversely drop) and actually creating the new array are separate steps. — hpaulj
– hpaulj, Commented Oct 4, 2020 at 23:37

Mark · Accepted Answer · 2020-10-04 23:45:40Z

3

If you can create an array withonly the values you are interested in, you can pass that to np.unique() which has an option to return_index.

One way to get the groups you want is to delete every third column. Pass that to np.unique() and get the indices:

import numpy as np

arr = np.array([[1,2,5,  4,2,7,  5,2,9],
                [4,4,1,   4,2,0,  3,6,4],
                [1,2,1,  4,2,2,  5,2,0],
                [1,2,7,  2,4,1,  5,2,8],
                [1,2,9,  4,2,8,  5,2,1],
                [4,2,0,  4,4,1,   5,2,4],
                [4,4,0,  4,2,6,  3,6,6],
                [1,2,1,  4,2,2,  5,2,0]])



unique_cols = np.delete(arr, slice(2, None, 3), axis=1)
vals, indices = np.unique(unique_cols, axis=0, return_index=True)

arr[sorted(indices)]

output:

array([[1, 2, 5, 4, 2, 7, 5, 2, 9],
       [4, 4, 1, 4, 2, 0, 3, 6, 4],
       [1, 2, 7, 2, 4, 1, 5, 2, 8],
       [4, 2, 0, 4, 4, 1, 5, 2, 4]])

answered Oct 4, 2020 at 23:45

Mark

92.7k8 gold badges116 silver badges156 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user109387 Over a year ago

Works very efficiently on large arrays.

Collectives™ on Stack Overflow

Dropping array rows that DUPLICATE defined column elements of other array rows

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related