Iterating over rows in a Pandas DataFrame means accessing each row one by one to perform operations or calculations. For example, you have a DataFrame of employees salaries and bonuses and want to calculate total compensation for each employee efficient row-wise operations are essential.
Let’s consider this DataFrame:
import pandas as pd
import numpy as np
data = {'A': np.random.randint(1, 20, 10),
'B': np.random.randint(10, 30, 10),
'C': np.random.choice(['X', 'Y', 'Z'], 10)}
df = pd.DataFrame(data)
print(df)
Output
A B C
0 2 21 X
1 7 21 X
2 14 27 X
3 2 29 X
4 16 21 Z
5 18 10 Y
6 7 28 Z
7 12 21 Z
8 15 11 X
9 13 24 Z
Now, let’s explore the most efficient methods one by one.
Using itertuples()
itertuples() returns each row as a lightweight named tuple, preserving data types and consuming less memory. It is ideal for large datasets when you need structured row-wise access.
Example: In this example, we compute a new column Result based on column C. If C is 'X', we multiply A and B; otherwise, we add them.
results = []
for row in df.itertuples(index=False):
results.append(row.A * row.B if row.C == 'X' else row.A + row.B)
df['Result'] = results
print(df)
Output
A B C Result
0 11 12 X 132
1 10 24 Y 34
2 11 28 Y 39
3 17 22 Z 39
4 9 20 Z 29
5 13 15 Z 28
6 2 27 Y 29
7 10 18 Z 28
8 5 14 Y 19
9 17 25 X 425
Explanation:
- df.itertuples(index=False) iterates over each row as a named tuple.
- For rows where C is 'X', A * B is calculated; otherwise A + B.
- Results are stored in a list and assigned to the new column Result.
Using apply()
.apply() allows applying a custom function to each row or column. It is flexible for complex logic that depends on multiple columns but slower than itertuples().
Example: In this example, a custom function calculates Result based on column C.
def custom_func(row):
return row['A'] * 2 if row['C'] == 'X' else row['B'] * 3
df['Result'] = df.apply(custom_func, axis=1)
print(df)
Output
A B C Result
0 16 23 X 32
1 2 26 X 4
2 5 24 Z 72
3 16 22 X 32
4 16 28 X 32
5 9 10 Z 30
6 17 16 Z 48
7 15 11 Y 33
8 3 27 Z 81
9 9 10 Y 30
Explanation:
- custom_func is applied to each row of the DataFrame (axis=1).
- If C is 'X', the function returns A * 2; otherwise, it returns B * 3.
- The results are directly assigned to a new column Result.
Vectorization
Vectorized operations perform calculations on entire columns at once without explicit iteration. They are the fastest method for large datasets and should be preferred when possible.
Example: In this example, Result is computed using np.where for conditional vectorized operations.
df['Result'] = np.where(df['C'] == 'X', df['A'] * df['B'], df['A'] + df['B'])
print(df)
Output
A B C Result
0 5 28 X 140
1 18 29 X 522
2 6 17 X 102
3 11 19 Y 30
4 15 10 X 150
5 9 20 Y 29
6 14 17 X 238
7 8 16 X 128
8 2 22 Y 24
9 11 27 X 297
Explanation:
- np.where applies a vectorized conditional operation across the entire DataFrame column.
- For rows where C is 'X', A * B is calculated; otherwise, A + B.
- The entire column is updated in a single operation without explicit iteration.