Coming from R, I’m not a big fan of MultiIndex DataFrames in Pandas. I definitely see the merits, but it just doesn’t feel right within a machine learning and feature engineering context. However, sometimes you will end up with a MultiIndex DataFrame, after some ninja line of code. In this blog post I explain how to flatten a MultiIndex DataFrame.
🐱🏍Update (2021-09-03): blog post that uses to_flat_index!
First, some dummy date to demonstrate.
import pandas as pd df = pd.DataFrame(data = {'id': ['aaa', 'aaa', 'bbb', 'bbb', 'ccc'], 'val': [4, 5, 10, 3, 1]})
A typical situation that results in a MultiIndex DataFrame is when you use groupby and apply multiple aggregation functions to a column. Like this:
df.groupby(['id'], as_index = True).agg({'val': ['mean', 'median', 'sum']})
This is the result:
What if I want column names like this: val_mean, val_median and val_sum? Well, this is how you can do it:
df.columns = ['_'.join(col).strip() for col in df.columns.values]
Why does this work?
Well, let’s take a look at df.columns.values. What happens is the following. The column level names get returned as tuples. By looping over these tuples and joining the values within the tuples, tfor each column, the different column level names are concatenated in one column name.
By the way, I didn’t necessarily come up with this solution myself. Although I’m grateful you’ve visited this blog post, you should know I get a lot from websites like StackOverflow and I have a lot of coding books. This one by Matt Harrison (on Pandas 1.x!) has been updated in 2020 and is an absolute primer on Pandas basics. If you want something broad, ranging from data wrangling to machine learning, try “Mastering Pandas” by Stefanie Molin.
Great success!
Your point of view caught my eye and was very interesting. Thanks. I have a question for you. https://www.binance.info/el/join?ref=JHQQKNKN