Let’s get your code 200x faster!
There are many ways to iterate through a Pandas DataFrame, according to your needs.
A commonly used way function is: iterrows().
However, it is not a recommended way to use apart from some very specific scenarios.
Alternative? itertuples().
Pandas’ itertuples() function is way faster than the iterrows() function.
Take a look at the code in the image.
I define a DataFrame with a million rows and one column, Number, containing numbers from 1 to 1 million.
To iterate through a million rows and calculate the sum of the numbers, it takes iterrows ~18s.
Whereas, itertuples does the same task in just 82ms!
One reason for this vast improvement is how these 2 functions work under the hood.
The iterrows yields the rows in a series, whereas the itertuples does not.
It uses named tuples to yield the rows, which is much faster.
So, the next time, use itertuples!
NOTE: This is by no means is the best way to calculate sum of the numbers. There are many other much simpler and faster ways
This post demonstrates the difference between these two functions on iterating through Dataframes. I have used it just to simplify the working.
#python #datascience #machinelearning