14

Question

Is there a good way to transform a DataFrame with an n-level index into an n-D Numpy array (a.k.a n-tensor)?


Example

Suppose I set up a DataFrame like

from pandas import DataFrame, MultiIndex

index = range(2), range(3)
value = range(2 * 3)
frame = DataFrame(value, columns=['value'],
                  index=MultiIndex.from_product(index)).drop((1, 0))
print frame

which outputs

     value
0 0      0
  1      1
  2      3
1 1      5
  2      6

The index is a 2-level hierarchical index. I can extract a 2-D Numpy array from the data using

print frame.unstack().values

which outputs

[[  0.   1.   2.]
 [ nan   4.   5.]]

How does this generalize to an n-level index?

Playing with unstack(), it seems that it can only be used to massage the 2-D shape of the DataFrame, but not to add an axis.

I cannot use e.g. frame.values.reshape(x, y, z), since this would require that the frame contains exactly x * y * z rows, which cannot be guaranteed. This is what I tried to demonstrate by drop()ing a row in the above example.

Any suggestions are highly appreciated.

2
  • The answer to "how does it generalize" is it doesn't. A pandas DataFrame is fundamentally a two-dimensional object. As your example shows, it doesn't enforce equal sizes across index "dimensions", so if you try to expand it to more dimensions, there may be gaps. I think if you want to get an n-D array you may have to make it yourself by iterating over the index levels and creating a separate "slice" of the result array for each. Pandas just isn't targeted at that sort of structure. Commented Jan 27, 2016 at 21:34
  • Thanks @Bren. I managed to address the problem of missing rows and to use reshape() (see below). This seems to work on my dataset, although I wouldn't be surprised if there are situations where it chokes. Commented Jan 27, 2016 at 23:22

2 Answers 2

18

Edit. This approach is much more elegant (and two orders of magnitude faster) than the one I gave below.

# create an empty array of NaN of the right dimensions
shape = map(len, frame.index.levels)
arr = np.full(shape, np.nan)

# fill it using Numpy's advanced indexing
arr[frame.index.codes] = frame.values.flat
# ...or in Pandas < 0.24.0, use
# arr[frame.index.labels] = frame.values.flat

Original solution. Given a setup similar to above, but in 3-D,

from pandas import DataFrame, MultiIndex
from itertools import product

index = range(2), range(2), range(2)
value = range(2 * 2 * 2)
frame = DataFrame(value, columns=['value'],
                  index=MultiIndex.from_product(index)).drop((1, 0, 1))
print(frame)

we have

       value
0 0 0      0
    1      1
  1 0      2
    1      3
1 0 0      4
  1 0      6
    1      7

Now, we proceed using the reshape() route, but with some preprocessing to ensure that the length along each dimension will be consistent.

First, reindex the data frame with the full cartesian product of all dimensions. NaN values will be inserted as needed. This operation can be both slow and consume a lot of memory, depending on the number of dimensions and on the size of the data frame.

levels = map(tuple, frame.index.levels)
index = list(product(*levels))
frame = frame.reindex(index)
print(frame)

which outputs

       value
0 0 0      0
    1      1
  1 0      2
    1      3
1 0 0      4
    1    NaN
  1 0      6
    1      7

Now, reshape() will work as intended.

shape = map(len, frame.index.levels)
print(frame.values.reshape(shape))

which outputs

[[[  0.   1.]
  [  2.   3.]]

 [[  4.  nan]
  [  6.   7.]]]

The (rather ugly) one-liner is

frame.reindex(list(product(*map(tuple, frame.index.levels)))).values\
     .reshape(map(len, frame.index.levels))
Sign up to request clarification or add additional context in comments.

6 Comments

works nicely! there's a minor typo: frame.reindex(levels) should be frame.reindex(index).
For us noobies; don't forget that in python3 you'll need to turn the result of 'map' into a list before any of this will work. ie. shape = list(map(len, frame.index.levels))
Getting the shape is more straight forward: frame.index.levshape. Neither this nor the given solution seem to work with non-unique indices.
Doing df.index.labels I get AttributeError: 'MultiIndex' object has no attribute 'labels'. What's up with that?
@CrabMan Very late response, but MultiIndex.labels has been deprecated in favor of MultiIndex.codes - using the latter should fix the error. (pandas-docs.github.io/pandas-docs-travis/whatsnew/…)
|
0

This can be done quite nicely using the Python xarray package which can be found here: http://xarray.pydata.org/en/stable/. It has great integration with Pandas and is quite intuitive once you get to grips with it.

If you have a multiindex series you can call the built-in method multiindex_series.to_xarray() (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_xarray.html). This will generate a DataArray object, which is essentially a name-indexed numpy array, using the index values and names as coordinates. Following this you can call .values on the DataArray object to get the underlying numpy array.

If you need your tensor to conform to a set of keys in a specific order, you can also call .reindex(index_name = index_values_in_order) (http://xarray.pydata.org/en/stable/generated/xarray.DataArray.reindex.html) on the DataArray. This can be extremely useful and makes working with the newly generated tensor much easier!

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.