Another blog post about an absolute starter subject. Calculating the numbers of rows of a pandas DataFrame is really simple. Yet there are many ways to do them. In this blog post, we elaborate on each and every one of the possible solutions. By looking at the codebase of pandas, we try to understand what’s going on behind the screens and why one or the other solution is faster.
Get the length of the DataFrame
The easiest way to get the length of a pandas DataFrame is by requesting its length using len(). In most cases, this is the most concise way to do it.
len(df)
However, we can speed up this process — this might come in handy if you’re processing huge DataFrames. If we look at the pandas codebase, we get a hint: the len function, actually requests the length of the index of the DataFrame.
def __len__(self) -> int: """ Returns length of info axis, but here we use the index. """ return len(self.index)
Get the length of the DataFrame index
If that is the case, we can do that ourselves, and skip the intermediate step, don’t we? Let’s request the len() of the DataFrame index ourselves. This improves the speed between 25%-40%.
len(df.index)
Can we do better? Of course we can. If you didn’t specify any index, the result of df.index will be of the type RangeIndex. The len method of a RangeIndex actually translates itself to a range, before calling the len function on it.
def __len__(self) -> int: """ return the length of the RangeIndex """ return len(self._range)
Get the length of the range of the RangeIndex
That’s another intermediate step we can skip. Transforming the DataFrame to a RangeIndex and then to a range ourselves doubles the speed once again.
len(df.index._range)
What about the shape or the axes?
This is something you might run into a lot: shape() returns a tuple that contains the number of rows and the number of columns. It is 50% slower than what we started with originally — the length of the DataFrame.
df.shape[0]
Why is that? If we look at the codebase, the answer becomes clear immediately. The shape property calls the len function twice. Once for the rows and once for the columns.
def shape(self) -> Tuple[int, int]: """ Return a tuple representing the dimensionality of the DataFrame. """ return len(self.index), len(self.columns)
Finally, what about axes?
len(df.axes[0])
This follows the exact same reasoning: the axes property returns the index and the columns, making it slower than our initial solutions.
def axes(self) -> List[Index]: """ Return a list representing the axes of the DataFrame. It has the row axis labels and column axis labels as the only members. They are returned in that order. Examples -------- >>> df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) >>> df.axes [RangeIndex(start=0, stop=2, step=1), Index(['col1', 'col2'], dtype='object')] """ return [self.index, self.columns]
Concluding remarks
How relevant is this? Honestly, not very. The speed of counting the numbers of a pandas DataFrame doesn’t very much depend on the size of the DataFrame. It doesn’t get slower in an exponential fashion. It slows down rather linear, but it is almost negligible.
Great success!
Thank you! So helpful. This is the only place where I found the difference between len(df) vs len(df.index). Great explanation. Thank you!
Meilleure application de contrôle parental pour protéger vos enfants – Moniteur secrètement secret GPS, SMS, appels, WhatsApp, Facebook, localisation. Vous pouvez surveiller à distance les activités du téléphone mobile après le téléchargement et installer l’apk sur le téléphone cible.