w3resource

Compare data aggregation using groupby vs. manual iteration in Pandas

Pandas: Performance Optimization Exercise-6 with Solution

Write a Pandas program that uses the groupby method to aggregate data and compares performance with manually iterating through the DataFrame.

Sample Solution :

Python Code :

import pandas as pd  # Import the Pandas library
import numpy as np  # Import the NumPy library
import time  # Import the time module to measure execution time

# Create a sample DataFrame
np.random.seed(0)  # Set seed for reproducibility
data = {
    'Category': np.random.choice(['A', 'B', 'C', 'D'], size=1000000),
    'Values': np.random.randint(1, 100, size=1000000)
}
df = pd.DataFrame(data)

# Define a custom aggregation function
def custom_aggregation(data):
    result = {}
    for category in data['Category'].unique():
        result[category] = data[data['Category'] == category]['Values'].sum()
    return result

# Aggregate data using the groupby method
start_time = time.time()  # Record the start time
groupby_result = df.groupby('Category')['Values'].sum()
time_groupby = time.time() - start_time  # Calculate the time taken

# Aggregate data using manual iteration
start_time = time.time()  # Record the start time
manual_result = custom_aggregation(df)
time_manual = time.time() - start_time  # Calculate the time taken

# Print the results
print("Aggregation result using groupby:")
print(groupby_result)
print("\nTime taken using groupby:", time_groupby, "seconds")

print("\nAggregation result using manual iteration:")
print(manual_result)
print("\nTime taken using manual iteration:", time_manual, "seconds")

Output:

Aggregation result using groupby:
Category
A    12541392
B    12440541
C    12477135
D    12502875
Name: Values, dtype: int32

Time taken using groupby: 0.0588076114654541 seconds

Aggregation result using manual iteration:
{'A': 12541392, 'D': 12502875, 'B': 12440541, 'C': 12477135}

Time taken using manual iteration: 0.2567422389984131 seconds

Explanation:

  • Import Libraries:
    • Import the Pandas library for data manipulation.
    • Import the NumPy library for generating random data.
    • Import the time module to measure execution time.
  • Create a Sample DataFrame:
    • Set a seed for reproducibility using np.random.seed(0).
    • Create a dictionary data with a 'Category' column containing random category labels and a 'Values' column containing random integers.
    • Generate a DataFrame df using the dictionary.
  • Define Custom Aggregation Function:
    • Create a function custom_aggregation(data) to manually iterate through the DataFrame and aggregate the 'Values' column based on the 'Category' column.
  • Aggregate Data Using groupby:
    • Record the start time using time.time().
    • Use the groupby method to aggregate the 'Values' column by 'Category' and calculate the sum.
    • Calculate the time taken by subtracting the start time from the current time.
  • Aggregate Data Using Manual Iteration:
    • Record the start time using time.time().
    • Use the custom_aggregation function to manually iterate through the DataFrame and aggregate the 'Values' column based on the 'Category' column.
    • Calculate the time taken by subtracting the start time from the current time.
  • Finally display the aggregation results and the time taken for both the groupby method and the manual iteration method.

Python-Pandas Code Editor:

Have another way to solve this solution? Contribute your code (and comments) through Disqus.

Previous: Compare DataFrame row filtering using for loop vs. Boolean indexing.
Next: Compare DataFrame merge using merge method vs. nested for loop in Pandas.

What is the difficulty level of this exercise?

Test your Programming skills with w3resource's quiz.



Follow us on Facebook and Twitter for latest update.