w3resource

Reduce memory usage in Pandas DataFrame using astype method

Pandas: Performance Optimization Exercise-4 with Solution

Write a Pandas program that uses the "astype" method to convert the data types of a DataFrame and measures the reduction in memory usage.

Sample Solution :

Python Code :

import pandas as pd  # Import the Pandas library
import numpy as np  # Import the NumPy library

# Create a sample DataFrame with mixed data types
np.random.seed(0)  # Set seed for reproducibility
data = {
    'int_col': np.random.randint(0, 100, size=100000),
    'float_col': np.random.random(size=100000) * 100,
    'category_col': np.random.choice(['A', 'B', 'C'], size=100000),
    'object_col': np.random.choice(['foo', 'bar', 'baz'], size=100000)
}
df = pd.DataFrame(data)

# Print memory usage before optimization
print("Memory usage before optimization:")
print(df.info(memory_usage='deep'))

# Convert data types using astype method
df['int_col'] = df['int_col'].astype('int16')
df['float_col'] = df['float_col'].astype('float32')
df['category_col'] = df['category_col'].astype('category')
df['object_col'] = df['object_col'].astype('category')

# Print memory usage after optimization
print("\nMemory usage after optimization:")
print(df.info(memory_usage='deep'))

Output:

Memory usage before optimization:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   int_col       100000 non-null  int32  
 1   float_col     100000 non-null  float64
 2   category_col  100000 non-null  object 
 3   object_col    100000 non-null  object 
dtypes: float64(1), int32(1), object(2)
memory usage: 12.4 MB
None

Memory usage after optimization:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype   
---  ------        --------------   -----   
 0   int_col       100000 non-null  int16   
 1   float_col     100000 non-null  float32 
 2   category_col  100000 non-null  category
 3   object_col    100000 non-null  category
dtypes: category(2), float32(1), int16(1)
memory usage: 781.9 KB
None

Explanation:

  • Import Libraries:
    • Import the Pandas library for data manipulation.
    • Import the NumPy library for generating random data.
  • Create a sample DataFrame:
    • Set a seed for reproducibility using np.random.seed(0).
    • Create a dictionary data with columns of mixed data types: integers, floats, categories, and objects.
    • Generate a DataFrame df using the dictionary.
  • Print memory usage before optimization:
    • Use df.info(memory_usage='deep') to display the memory usage of the DataFrame before optimization.
  • Convert data types using astype method:
    • Convert the 'int_col' to 'int16'.
    • Convert the 'float_col' to 'float32'.
    • Convert the 'category_col' and 'object_col' to 'category'.
  • Print Memory usage after optimization:
    • Use df.info(memory_usage='deep') to display the memory usage of the DataFrame after optimization.

Python-Pandas Code Editor:

Have another way to solve this solution? Contribute your code (and comments) through Disqus.

Previous: Optimize Memory usage when loading large CSV into Pandas DataFrame.
Next: Compare DataFrame row filtering using for loop vs. Boolean indexing.

What is the difficulty level of this exercise?

Test your Programming skills with w3resource's quiz.



Follow us on Facebook and Twitter for latest update.