1

Two 2D numpy arrays are given (arr_all and arr_sub) whereas the second is a random subset of the first. I need to get the rows of the first one (arr_all) that are not included in the second one (arr_sub) based on an ID in one column that exist in both arrays. e.g.:

arr_all = array([[ x,  y,  z,  id_1],
#        [x,  y,  z,  id_2],
#        [x,  y,  z,  id_3],
#        [x,  y,  z,  id_4],
#        [x,  y,  z,  id_5]])

arr_sub = array([[ x,  y,  z,  id_1],
#        [x,  y,  z,  id_2],
#        [x,  y,  z,  id_5]])

wanted output:

arr_remain = array([[ x,  y,  z,  id_3],
#        [x,  y,  z,  id_4]])

A working solution would be:

list_remain = []
for i in range(len(ds_all)):
if ds_all[i][3] not in ds_trees[:,3]:
    list_remain.append(ds_all[i])

arr_remain = np.array(list_remain)

This solution however is unfortunately only good for a small dataset because of it's horrible slow runtime. Since my original dataset contains over 26 mio rows, this is not sufficient.

I tried to adapt solutions like this, this or this but I didn't manage to add the check if the ID exist in the other arrays column.

1 Answer 1

0

Here's one way:

arr_remain = arr_all[~np.in1d(arr_all[:,-1], arr_sub[:,-1])]
# or arr_remain = arr_all[~np.isin(arr_all[:,-1], arr_sub[:,-1])]
OUTPUT:
array([['x', 'y', 'z', 'id_3'],
       ['x', 'y', 'z', 'id_4']], dtype='<U4')
Sign up to request clarification or add additional context in comments.

1 Comment

thanks a lot! This is way faster. Just one note, my IDE complained about in1d and preferred isin. Seems to be the more recent solution for this task.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.