w3resource

Optimize memory usage with Categorical data type in Pandas DataFrame

Pandas: Performance Optimization Exercise-8 with Solution

Write a Pandas program to create a DataFrame with categorical data and use the category data type to optimize memory usage. Measure the performance difference.

Sample Solution :

Python Code :

import pandas as pd  # Import the Pandas library
import numpy as np  # Import the NumPy library

# Create a sample DataFrame with categorical data
np.random.seed(0)  # Set seed for reproducibility
data = {
    'Category': np.random.choice(['A', 'B', 'C', 'D'], size=1000000),
    'Values': np.random.randint(1, 100, size=1000000)
}
df = pd.DataFrame(data)

# Print memory usage before optimization
print("Memory usage before optimization:")
print(df.info(memory_usage='deep'))

# Convert the 'Category' column to the category data type
df['Category'] = df['Category'].astype('category')

# Print memory usage after optimization
print("\nMemory usage after optimization:")
print(df.info(memory_usage='deep'))

Output:

Memory usage before optimization:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
 #   Column    Non-Null Count    Dtype 
---  ------    --------------    ----- 
 0   Category  1000000 non-null  object
 1   Values    1000000 non-null  int32 
dtypes: int32(1), object(1)
memory usage: 59.1 MB
None

Memory usage after optimization:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
 #   Column    Non-Null Count    Dtype   
---  ------    --------------    -----   
 0   Category  1000000 non-null  category
 1   Values    1000000 non-null  int32   
dtypes: category(1), int32(1)
memory usage: 4.8 MB
None

Explanation:

  • Import Libraries:
    • Import the Pandas library for data manipulation.
    • Import the NumPy library for generating random data.
  • Create a Sample DataFrame with Categorical Data:
    • Set a seed for reproducibility using np.random.seed(0).
    • Create a dictionary data with a 'Category' column containing random category labels and a 'Values' column containing random integers.
    • Generate a DataFrame df using the dictionary.
  • Print Memory Usage Before Optimization:
    • Use df.info(memory_usage='deep') to display the memory usage of the DataFrame before optimization.
  • Convert Column to Category Data Type:
    • Use the astype method to convert the 'Category' column to the category data type.
  • Print Memory Usage After Optimization:
    • Use df.info(memory_usage='deep') to display the memory usage of the DataFrame after optimization.

Python-Pandas Code Editor:

Have another way to solve this solution? Contribute your code (and comments) through Disqus.

Previous: Compare DataFrame merge using merge method vs. nested for loop in Pandas.
Next: Compare DataFrame element-wise multiplication using for loop vs. * Operator.

What is the difficulty level of this exercise?

Test your Programming skills with w3resource's quiz.



Become a Patron!

Follow us on Facebook and Twitter for latest update.

It will be nice if you may share this link in any developer community or anywhere else, from where other developers may find this content. Thanks.

https://w3resource.com/python-exercises/pandas/optimize-memory-usage-with-categorical-data-type-in-pandas-dataframe.php