2

Take the following Array:

import numpy as np

arr_dupes = np.array(
    [
      ('2017-09-13T11:05:00.000000',  1.32685,  1.32704,  1.32682,  1.32686,  1.32684,  1.32702,  1.32679,  1.32683, 246),
      ('2017-09-13T11:05:00.000000',  1.32685,  1.32704,  1.32682,  1.32686,  1.32684,  1.32702,  1.32679,  1.32683, 246),
      ('2017-09-13T11:05:00.000000',  1.32685,  1.32704,  1.32682,  1.32686,  1.32684,  1.32702,  1.32679,  1.32683, 222),
      ('2017-09-13T11:04:00.000000',  1.32683,  1.32686,  1.32682,  1.32685,  1.32682,  1.32684,  1.3268 ,  1.32684,  97),
      ('2017-09-13T11:03:00.000000',  1.32664,  1.32684,  1.32663,  1.32683,  1.32664,  1.32683,  1.32661,  1.32682, 268),
      ('2017-09-13T11:02:00.000000',  1.3268 ,  1.32692,  1.3266 ,  1.32664,  1.32678,  1.32689,  1.32658,  1.32664, 299),
      ('2017-09-13T11:02:00.000000',  1.3268 ,  1.32692,  1.3266 ,  1.32664,  1.32678,  1.32689,  1.32658,  1.32664, 299),
      ('2017-09-13T11:01:00.000000',  1.32648,  1.32682,  1.32648,  1.3268 ,  1.32647,  1.32682,  1.32647,  1.32678, 322),
      ('2017-09-13T11:00:00.000000',  1.32647,  1.32649,  1.32628,  1.32648,  1.32644,  1.32651,  1.32626,  1.32647, 285)],
      dtype=[('date', '<M8[us]'), ('askopen', '<f8'), ('askhigh', '<f8'), ('asklow', '<f8'), ('askclose', '<f8'),
             ('bidopen', '<f8'), ('bidhigh', '<f8'), ('bidlow', '<f8'), ('bidclose', '<f8'), ('volume', '<i8')]
)

What is the fastest solution to remove duplicates, using the dates as an index and keeping the last value?

Pandas DataFrame equivalent is

In [5]: df = pd.DataFrame(arr_dupes, index=arr_dupes['date'])
In [6]: df
Out[6]:
                                   date  askopen  askhigh   asklow  askclose  bidopen  bidhigh   bidlow  bidclose  volume
2017-09-13 11:05:00 2017-09-13 11:05:00  1.32685  1.32704  1.32682   1.32686  1.32684  1.32702  1.32679   1.32683     246
2017-09-13 11:05:00 2017-09-13 11:05:00  1.32685  1.32704  1.32682   1.32686  1.32684  1.32702  1.32679   1.32683     246
2017-09-13 11:05:00 2017-09-13 11:05:00  1.32685  1.32704  1.32682   1.32686  1.32684  1.32702  1.32679   1.32683     222
2017-09-13 11:04:00 2017-09-13 11:04:00  1.32683  1.32686  1.32682   1.32685  1.32682  1.32684  1.32680   1.32684      97
2017-09-13 11:03:00 2017-09-13 11:03:00  1.32664  1.32684  1.32663   1.32683  1.32664  1.32683  1.32661   1.32682     268
2017-09-13 11:02:00 2017-09-13 11:02:00  1.32680  1.32692  1.32660   1.32664  1.32678  1.32689  1.32658   1.32664     299
2017-09-13 11:02:00 2017-09-13 11:02:00  1.32680  1.32692  1.32660   1.32664  1.32678  1.32689  1.32658   1.32664     299
2017-09-13 11:01:00 2017-09-13 11:01:00  1.32648  1.32682  1.32648   1.32680  1.32647  1.32682  1.32647   1.32678     322
2017-09-13 11:00:00 2017-09-13 11:00:00  1.32647  1.32649  1.32628   1.32648  1.32644  1.32651  1.32626   1.32647     285

In [7]: df.reset_index().drop_duplicates(subset='date', keep='last').set_index('date')
Out[7]:
                                  index  askopen  askhigh   asklow  askclose  bidopen  bidhigh   bidlow  bidclose  volume
date
2017-09-13 11:05:00 2017-09-13 11:05:00  1.32685  1.32704  1.32682   1.32686  1.32684  1.32702  1.32679   1.32683     222
2017-09-13 11:04:00 2017-09-13 11:04:00  1.32683  1.32686  1.32682   1.32685  1.32682  1.32684  1.32680   1.32684      97
2017-09-13 11:03:00 2017-09-13 11:03:00  1.32664  1.32684  1.32663   1.32683  1.32664  1.32683  1.32661   1.32682     268
2017-09-13 11:02:00 2017-09-13 11:02:00  1.32680  1.32692  1.32660   1.32664  1.32678  1.32689  1.32658   1.32664     299
2017-09-13 11:01:00 2017-09-13 11:01:00  1.32648  1.32682  1.32648   1.32680  1.32647  1.32682  1.32647   1.32678     322
2017-09-13 11:00:00 2017-09-13 11:00:00  1.32647  1.32649  1.32628   1.32648  1.32644  1.32651  1.32626   1.32647     285

numpy.unique seems to compare the entire tuple and will return duplicates.

Final output should look like this.

array([
      ('2017-09-13T11:05:00.000000',  1.32685,  1.32704,  1.32682,  1.32686,  1.32684,  1.32702,  1.32679,  1.32683, 222),
      ('2017-09-13T11:04:00.000000',  1.32683,  1.32686,  1.32682,  1.32685,  1.32682,  1.32684,  1.3268 ,  1.32684,  97),
      ('2017-09-13T11:03:00.000000',  1.32664,  1.32684,  1.32663,  1.32683,  1.32664,  1.32683,  1.32661,  1.32682, 268),
      ('2017-09-13T11:02:00.000000',  1.3268 ,  1.32692,  1.3266 ,  1.32664,  1.32678,  1.32689,  1.32658,  1.32664, 299),
      ('2017-09-13T11:01:00.000000',  1.32648,  1.32682,  1.32648,  1.3268 ,  1.32647,  1.32682,  1.32647,  1.32678, 322),
      ('2017-09-13T11:00:00.000000',  1.32647,  1.32649,  1.32628,  1.32648,  1.32644,  1.32651,  1.32626,  1.32647, 285)],
      dtype=[('date', '<M8[us]'), ('askopen', '<f8'), ('askhigh', '<f8'), ('asklow', '<f8'), ('askclose', '<f8'),
             ('bidopen', '<f8'), ('bidhigh', '<f8'), ('bidlow', '<f8'), ('bidclose', '<f8'), ('volume', '<i8')]
)

Thank-you

4
  • If it was keep=last, you'd have a different output to what you've shown... Commented Sep 24, 2017 at 13:21
  • @COLDSPEED are you sure? I have added the Pandas version Commented Sep 24, 2017 at 13:57
  • @James Why can't you use pandas? Commented Oct 1, 2017 at 6:45
  • @ChaosPredictor Pandas is great but this adds a lot of overhead. Speed in this instance is important Commented Oct 1, 2017 at 6:47

1 Answer 1

4

It seems that the solution to your problem doesn't have to mimic pandas drop_duplicates() function, but I'll provide one that mimics it and one that doesn't.

If you need the exact same behavior as pandas drop_duplicates() then the following code is a way to go:

#initialization of arr_dupes here

#actual algorithm

helper1, helper2 = np.unique(arr_dupes['date'][::-1], return_index = True)

result = arr_dupes[::-1][helper2][::-1]

When arr_dupes is initialized you need to pass only the 'date' column to numpy.unique(). Also since you are interested in the last of non-unique elements in an array you have to reverse the order of the array that you pass to unique() with [::-1]. This way unique() will throw out every non-unique element except last one. Then unique() returns a list of unique elements (helper1) as first return value and a list of indices of those elements in original array (helper2) as second return value. Lastly a new array is created by picking elements listed in helper2 from the original array arr_dupes.

This solution is about 9.898 times faster than pandas version.

Now let me explain what I meant in the beginning of this answer. It seems to me that your array is sorted by the 'date' column. If it is true then we can assume that duplicates are going to be grouped together. If they are grouped together then we only need to keep rows whose next rows 'date' column is different than the current rows 'date' column. So for example if we take a look at the following array rows:

...
  ('2017-09-13T11:05:00.000000',  1.32685,  1.32704,  1.32682,  1.32686,  1.32684,  1.32702,  1.32679,  1.32683, 246),
  ('2017-09-13T11:05:00.000000',  1.32685,  1.32704,  1.32682,  1.32686,  1.32684,  1.32702,  1.32679,  1.32683, 246),
  ('2017-09-13T11:05:00.000000',  1.32685,  1.32704,  1.32682,  1.32686,  1.32684,  1.32702,  1.32679,  1.32683, 222),
  ('2017-09-13T11:04:00.000000',  1.32683,  1.32686,  1.32682,  1.32685,  1.32682,  1.32684,  1.3268 ,  1.32684,  97),
...

The third rows 'date' column is different than the fourths and we need to keep it. No need to do any more checks. First rows 'date' column is the same as the second rows and we don't need that row. Same goes for the second row. So in code it looks like this:

#initialization of arr_dupes here

#actual algorithm

result = arr_dupes[np.concatenate((arr_dupes['date'][:-1] != arr_dupes['date'][1:], np.array([True])))]

First every element of a 'date' column is compared with the next element. This creates an array of trues and falses. If an index in this boolean array has a true asigned to it then an arr_dupes element with that index needs to stay. Otherwise it needs to go. Next, concatenate() just adds one last true value to this boolean array since last element always needs to stay in the resulting array.

This solution is about 17 times faster than pandas version.

Sign up to request clarification or add additional context in comments.

1 Comment

I just +1'd your answer - thanks for sharing. I will test and come back to you shortly.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.