w3resource

Compare performance of apply vs. Vectorized operations in Pandas

Pandas: Performance Optimization Exercise-2 with Solution

Write a Pandas program to compare the performance of applying a custom function to a column using apply vs. using vectorized operations.

Sample Solution :

Python Code :

import pandas as pd  # Import the Pandas library
import numpy as np  # Import the NumPy library
import time  # Import the time module to measure execution time

# Create a large DataFrame with random integers
np.random.seed(0)  # Set seed for reproducibility
data = np.random.randint(1, 100, size=(1000000, 1))  # Generate random data
df = pd.DataFrame(data, columns=['Values'])  # Create a DataFrame

# Define a custom function to apply
def custom_function(x):
    return x * 2 + 3

# Measure the time taken to apply the custom function using apply
start_time = time.time()  # Record the start time
df['Apply_Result'] = df['Values'].apply(custom_function)  # Apply the custom function using apply
time_apply = time.time() - start_time  # Calculate the time taken

# Measure the time taken to apply the custom function using vectorized operations
start_time = time.time()  # Record the start time
df['Vectorized_Result'] = custom_function(df['Values'])  # Apply the custom function using vectorized operations
time_vectorized = time.time() - start_time  # Calculate the time taken

# Print the time taken for both methods
print("Time taken using apply:", time_apply, "seconds")
print("Time taken using vectorized operations:", time_vectorized, "seconds")

Output:

Time taken using apply: 0.25844264030456543 seconds
Time taken using vectorized operations: 0.0029630661010742188 seconds

Explanation:

  • Import Libraries:
    • Import the Pandas library for data manipulation.
    • Import the NumPy library for generating random data.
    • Import the time module to measure execution time.
  • Create a Large DataFrame:
    • Set a seed for reproducibility using np.random.seed(0).
    • Generate random integers with np.random.randint and create a large DataFrame with 1,000,000 rows and one column named 'Values'.
  • Define a Custom Function:
    • Create a custom function custom_function(x) that performs a simple operation on the input x (e.g., x * 2 + 3).
  • Measure Time Using apply:
    • Record the start time using time.time().
    • Apply the custom function to the 'Values' column using the Pandas apply method and store the result in a new column 'Apply_Result'.
    • Calculate the time taken by subtracting the start time from the current time.
  • Measure Time Using Vectorized Operations:
    • Record the start time using time.time().
    • Apply the custom function to the 'Values' column using vectorized operations and store the result in a new column 'Vectorized_Result'.
    • Calculate the time taken by subtracting the start time from the current time.
  • Finally display the time taken for both the "apply()" method and the vectorized operations.

Python-Pandas Code Editor:

Have another way to solve this solution? Contribute your code (and comments) through Disqus.

Previous: Compare column summation using for loop vs. sum method in Pandas.
Next: Optimize Memory usage when loading large CSV into Pandas DataFrame.

What is the difficulty level of this exercise?

Test your Programming skills with w3resource's quiz.



Follow us on Facebook and Twitter for latest update.