w3resource

Optimize memory usage with Categorical data type in Pandas DataFrame

Pandas: Performance Optimization Exercise-8 with Solution

Write a Pandas program to create a DataFrame with categorical data and use the category data type to optimize memory usage. Measure the performance difference.

Sample Solution :

Python Code :

import pandas as pd  # Import the Pandas library
import numpy as np  # Import the NumPy library

# Create a sample DataFrame with categorical data
np.random.seed(0)  # Set seed for reproducibility
data = {
    'Category': np.random.choice(['A', 'B', 'C', 'D'], size=1000000),
    'Values': np.random.randint(1, 100, size=1000000)
}
df = pd.DataFrame(data)

# Print memory usage before optimization
print("Memory usage before optimization:")
print(df.info(memory_usage='deep'))

# Convert the 'Category' column to the category data type
df['Category'] = df['Category'].astype('category')

# Print memory usage after optimization
print("\nMemory usage after optimization:")
print(df.info(memory_usage='deep'))

Output:

Memory usage before optimization:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
 #   Column    Non-Null Count    Dtype 
---  ------    --------------    ----- 
 0   Category  1000000 non-null  object
 1   Values    1000000 non-null  int32 
dtypes: int32(1), object(1)
memory usage: 59.1 MB
None

Memory usage after optimization:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
 #   Column    Non-Null Count    Dtype   
---  ------    --------------    -----   
 0   Category  1000000 non-null  category
 1   Values    1000000 non-null  int32   
dtypes: category(1), int32(1)
memory usage: 4.8 MB
None

Explanation:

  • Import Libraries:
    • Import the Pandas library for data manipulation.
    • Import the NumPy library for generating random data.
  • Create a Sample DataFrame with Categorical Data:
    • Set a seed for reproducibility using np.random.seed(0).
    • Create a dictionary data with a 'Category' column containing random category labels and a 'Values' column containing random integers.
    • Generate a DataFrame df using the dictionary.
  • Print Memory Usage Before Optimization:
    • Use df.info(memory_usage='deep') to display the memory usage of the DataFrame before optimization.
  • Convert Column to Category Data Type:
    • Use the astype method to convert the 'Category' column to the category data type.
  • Print Memory Usage After Optimization:
    • Use df.info(memory_usage='deep') to display the memory usage of the DataFrame after optimization.

Python-Pandas Code Editor:

Have another way to solve this solution? Contribute your code (and comments) through Disqus.

Previous: Compare DataFrame merge using merge method vs. nested for loop in Pandas.
Next: Compare DataFrame element-wise multiplication using for loop vs. * Operator.

What is the difficulty level of this exercise?

Test your Programming skills with w3resource's quiz.



Follow us on Facebook and Twitter for latest update.