Take the following Array:
import numpy as np
arr_dupes = np.array(
[
('2017-09-13T11:05:00.000000', 1.32685, 1.32704, 1.32682, 1.32686, 1.32684, 1.32702, 1.32679, 1.32683, 246),
('2017-09-13T11:05:00.000000', 1.32685, 1.32704, 1.32682, 1.32686, 1.32684, 1.32702, 1.32679, 1.32683, 246),
('2017-09-13T11:05:00.000000', 1.32685, 1.32704, 1.32682, 1.32686, 1.32684, 1.32702, 1.32679, 1.32683, 222),
('2017-09-13T11:04:00.000000', 1.32683, 1.32686, 1.32682, 1.32685, 1.32682, 1.32684, 1.3268 , 1.32684, 97),
('2017-09-13T11:03:00.000000', 1.32664, 1.32684, 1.32663, 1.32683, 1.32664, 1.32683, 1.32661, 1.32682, 268),
('2017-09-13T11:02:00.000000', 1.3268 , 1.32692, 1.3266 , 1.32664, 1.32678, 1.32689, 1.32658, 1.32664, 299),
('2017-09-13T11:02:00.000000', 1.3268 , 1.32692, 1.3266 , 1.32664, 1.32678, 1.32689, 1.32658, 1.32664, 299),
('2017-09-13T11:01:00.000000', 1.32648, 1.32682, 1.32648, 1.3268 , 1.32647, 1.32682, 1.32647, 1.32678, 322),
('2017-09-13T11:00:00.000000', 1.32647, 1.32649, 1.32628, 1.32648, 1.32644, 1.32651, 1.32626, 1.32647, 285)],
dtype=[('date', '<M8[us]'), ('askopen', '<f8'), ('askhigh', '<f8'), ('asklow', '<f8'), ('askclose', '<f8'),
('bidopen', '<f8'), ('bidhigh', '<f8'), ('bidlow', '<f8'), ('bidclose', '<f8'), ('volume', '<i8')]
)
What is the fastest solution to remove duplicates, using the dates as an index and keeping the last value?
Pandas DataFrame equivalent is
In [5]: df = pd.DataFrame(arr_dupes, index=arr_dupes['date'])
In [6]: df
Out[6]:
date askopen askhigh asklow askclose bidopen bidhigh bidlow bidclose volume
2017-09-13 11:05:00 2017-09-13 11:05:00 1.32685 1.32704 1.32682 1.32686 1.32684 1.32702 1.32679 1.32683 246
2017-09-13 11:05:00 2017-09-13 11:05:00 1.32685 1.32704 1.32682 1.32686 1.32684 1.32702 1.32679 1.32683 246
2017-09-13 11:05:00 2017-09-13 11:05:00 1.32685 1.32704 1.32682 1.32686 1.32684 1.32702 1.32679 1.32683 222
2017-09-13 11:04:00 2017-09-13 11:04:00 1.32683 1.32686 1.32682 1.32685 1.32682 1.32684 1.32680 1.32684 97
2017-09-13 11:03:00 2017-09-13 11:03:00 1.32664 1.32684 1.32663 1.32683 1.32664 1.32683 1.32661 1.32682 268
2017-09-13 11:02:00 2017-09-13 11:02:00 1.32680 1.32692 1.32660 1.32664 1.32678 1.32689 1.32658 1.32664 299
2017-09-13 11:02:00 2017-09-13 11:02:00 1.32680 1.32692 1.32660 1.32664 1.32678 1.32689 1.32658 1.32664 299
2017-09-13 11:01:00 2017-09-13 11:01:00 1.32648 1.32682 1.32648 1.32680 1.32647 1.32682 1.32647 1.32678 322
2017-09-13 11:00:00 2017-09-13 11:00:00 1.32647 1.32649 1.32628 1.32648 1.32644 1.32651 1.32626 1.32647 285
In [7]: df.reset_index().drop_duplicates(subset='date', keep='last').set_index('date')
Out[7]:
index askopen askhigh asklow askclose bidopen bidhigh bidlow bidclose volume
date
2017-09-13 11:05:00 2017-09-13 11:05:00 1.32685 1.32704 1.32682 1.32686 1.32684 1.32702 1.32679 1.32683 222
2017-09-13 11:04:00 2017-09-13 11:04:00 1.32683 1.32686 1.32682 1.32685 1.32682 1.32684 1.32680 1.32684 97
2017-09-13 11:03:00 2017-09-13 11:03:00 1.32664 1.32684 1.32663 1.32683 1.32664 1.32683 1.32661 1.32682 268
2017-09-13 11:02:00 2017-09-13 11:02:00 1.32680 1.32692 1.32660 1.32664 1.32678 1.32689 1.32658 1.32664 299
2017-09-13 11:01:00 2017-09-13 11:01:00 1.32648 1.32682 1.32648 1.32680 1.32647 1.32682 1.32647 1.32678 322
2017-09-13 11:00:00 2017-09-13 11:00:00 1.32647 1.32649 1.32628 1.32648 1.32644 1.32651 1.32626 1.32647 285
numpy.unique seems to compare the entire tuple and will return duplicates.
Final output should look like this.
array([
('2017-09-13T11:05:00.000000', 1.32685, 1.32704, 1.32682, 1.32686, 1.32684, 1.32702, 1.32679, 1.32683, 222),
('2017-09-13T11:04:00.000000', 1.32683, 1.32686, 1.32682, 1.32685, 1.32682, 1.32684, 1.3268 , 1.32684, 97),
('2017-09-13T11:03:00.000000', 1.32664, 1.32684, 1.32663, 1.32683, 1.32664, 1.32683, 1.32661, 1.32682, 268),
('2017-09-13T11:02:00.000000', 1.3268 , 1.32692, 1.3266 , 1.32664, 1.32678, 1.32689, 1.32658, 1.32664, 299),
('2017-09-13T11:01:00.000000', 1.32648, 1.32682, 1.32648, 1.3268 , 1.32647, 1.32682, 1.32647, 1.32678, 322),
('2017-09-13T11:00:00.000000', 1.32647, 1.32649, 1.32628, 1.32648, 1.32644, 1.32651, 1.32626, 1.32647, 285)],
dtype=[('date', '<M8[us]'), ('askopen', '<f8'), ('askhigh', '<f8'), ('asklow', '<f8'), ('askclose', '<f8'),
('bidopen', '<f8'), ('bidhigh', '<f8'), ('bidlow', '<f8'), ('bidclose', '<f8'), ('volume', '<i8')]
)
Thank-you