Compare DataFrame merge using merge method vs. nested for loop in Pandas

Last update on December 21 2024 09:14:19 (UTC/GMT +8 hours)

Pandas: Performance Optimization Exercise-7 with Solution

Write a Pandas program that performs a merge operation on two large DataFrames using the "merge" method. It compares the performance with a nested for loop.

Sample Solution :

Python Code :

import pandas as pd  # Import the Pandas library
import numpy as np  # Import the NumPy library
import time  # Import the time module to measure execution time

# Create two large DataFrames
np.random.seed(0)  # Set seed for reproducibility
data1 = {
    'Key': np.random.randint(1, 1000, size=1000),
    'Value1': np.random.randint(1, 100, size=1000)
}
data2 = {
    'Key': np.random.randint(1, 1000, size=1000),
    'Value2': np.random.randint(1, 100, size=1000)
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

# Perform merge using the merge method
start_time = time.time()  # Record the start time
merged_df = pd.merge(df1, df2, on='Key')
time_merge = time.time() - start_time  # Calculate the time taken

# Perform merge using a nested for loop
start_time = time.time()  # Record the start time
merged_data = []
for index1, row1 in df1.iterrows():
    for index2, row2 in df2.iterrows():
        if row1['Key'] == row2['Key']:
            merged_data.append({**row1, **row2})
merged_df_loop = pd.DataFrame(merged_data)
time_nested_loop = time.time() - start_time  # Calculate the time taken

# Print the time taken for both methods
print("Time taken using merge method:", time_merge, "seconds")
print("Time taken using nested for loop:", time_nested_loop, "seconds")

Output:

Time taken using merge method: 0.0009975433349609375 seconds
Time taken using nested for loop: 31.308058738708496 seconds

Explanation:

Import Libraries:

Import the Pandas library for data manipulation.
Import the NumPy library for generating random data.
Import the time module to measure execution time.

Create Two Large DataFrames:

Set a seed for reproducibility using np.random.seed(0).
Create two dictionaries data1 and data2 with a 'Key' column containing random integers and a 'Value1' or 'Value2' column containing random integers.
Generate DataFrames 'df1' and 'df2' using the dictionaries.

Perform Merge Using "merge" Method:

Record the start time using time.time().
Use the "pd.merge" method to merge 'df1' and 'df2' on the 'Key' column.
Calculate the time taken by subtracting the start time from the current time.

Perform Merge Using Nested For Loop:

Record the start time using time.time().
Initialize an empty list merged_data to store the merged rows.
Use nested for loops to iterate through each row in 'df1' and 'df2'. If the 'Key' values match, combine the rows and append to merged_data.
Convert 'merged_data' to a DataFrame 'merged_df_loop'.
Calculate the time taken by subtracting the start time from the current time.

Print Results:

Display the time taken for both the merge method and the nested for loop method.

Python-Pandas Code Editor:

Have another way to solve this solution? Contribute your code (and comments) through Disqus.

Previous: Compare data aggregation using groupby vs. manual iteration in Pandas.
Next: Optimize memory usage with Categorical data type in Pandas DataFrame.