w3resource

Optimize memory usage with Categorical data type in Pandas DataFrame


8. Optimize Memory with Categorical Data

Write a Pandas program to create a DataFrame with categorical data and use the category data type to optimize memory usage. Measure the performance difference.

Sample Solution :

Python Code :

import pandas as pd  # Import the Pandas library
import numpy as np  # Import the NumPy library

# Create a sample DataFrame with categorical data
np.random.seed(0)  # Set seed for reproducibility
data = {
    'Category': np.random.choice(['A', 'B', 'C', 'D'], size=1000000),
    'Values': np.random.randint(1, 100, size=1000000)
}
df = pd.DataFrame(data)

# Print memory usage before optimization
print("Memory usage before optimization:")
print(df.info(memory_usage='deep'))

# Convert the 'Category' column to the category data type
df['Category'] = df['Category'].astype('category')

# Print memory usage after optimization
print("\nMemory usage after optimization:")
print(df.info(memory_usage='deep'))

Output:

Memory usage before optimization:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
 #   Column    Non-Null Count    Dtype 
---  ------    --------------    ----- 
 0   Category  1000000 non-null  object
 1   Values    1000000 non-null  int32 
dtypes: int32(1), object(1)
memory usage: 59.1 MB
None

Memory usage after optimization:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
 #   Column    Non-Null Count    Dtype   
---  ------    --------------    -----   
 0   Category  1000000 non-null  category
 1   Values    1000000 non-null  int32   
dtypes: category(1), int32(1)
memory usage: 4.8 MB
None

Explanation:

  • Import Libraries:
    • Import the Pandas library for data manipulation.
    • Import the NumPy library for generating random data.
  • Create a Sample DataFrame with Categorical Data:
    • Set a seed for reproducibility using np.random.seed(0).
    • Create a dictionary data with a 'Category' column containing random category labels and a 'Values' column containing random integers.
    • Generate a DataFrame df using the dictionary.
  • Print Memory Usage Before Optimization:
    • Use df.info(memory_usage='deep') to display the memory usage of the DataFrame before optimization.
  • Convert Column to Category Data Type:
    • Use the astype method to convert the 'Category' column to the category data type.
  • Print Memory Usage After Optimization:
    • Use df.info(memory_usage='deep') to display the memory usage of the DataFrame after optimization.

For more Practice: Solve these Related Problems:

  • Write a Pandas program to convert string columns of a DataFrame into categorical data types and measure memory reduction.
  • Write a Pandas program to create a DataFrame with categorical columns and compare the performance of operations before and after conversion.
  • Write a Pandas program to optimize memory usage by converting a high-cardinality column to a category and evaluate the effect on processing speed.
  • Write a Pandas program to load a dataset, convert appropriate columns to 'category' dtype, and then compare memory_usage() with the original DataFrame.

Go to:


Previous: Compare DataFrame merge using merge method vs. nested for loop in Pandas.
Next: Compare DataFrame element-wise multiplication using for loop vs. * Operator.

Python-Pandas Code Editor:

Have another way to solve this solution? Contribute your code (and comments) through Disqus.

What is the difficulty level of this exercise?

Test your Programming skills with w3resource's quiz.



Follow us on Facebook and Twitter for latest update.