w3resource

Optimize Memory usage when loading large CSV into Pandas DataFrame


3. Optimize Memory Usage When Loading CSV

Write a Pandas program that loads a large CSV file into a DataFrame and optimizes memory usage by specifying appropriate data types.

Sample Solution :

Python Code :

import pandas as pd  # Import the Pandas library

# Define the CSV file path
csv_file_path = 'large_csv_file.csv'

# Load a small chunk of the CSV file to infer data types
chunk = pd.read_csv(csv_file_path, nrows=100)

# Specify the data types for the columns based on the initial chunk
dtype_dict = {
    'column1': 'int32',
    'column2': 'float32',
    'column3': 'category',
    'column4': 'category',
    # Add more columns with appropriate data types
}

# Load the full CSV file with specified data types to optimize memory usage
df = pd.read_csv(csv_file_path, dtype=dtype_dict)

# Print memory usage after optimization
print("Memory usage after optimization:")
print(df.info(memory_usage='deep'))

Output:

Memory usage after optimization:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2715 entries, 0 to 2714
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   1         2715 non-null   int64  
 1   1.02      2715 non-null   float64
 2   Folder    2715 non-null   object 
 3   Folder.1  2715 non-null   object 
dtypes: float64(1), int64(1), object(2)
memory usage: 629.5 KB
None

Explanation:

  • Import Pandas Library:
    • Import the Pandas library for data manipulation.
  • Define CSV File Path:
    • Specify the path to the large CSV file with csv_file_path.
  • Load Initial Chunk:
    • Load a small chunk of the CSV file (e.g., 100 rows) using pd.read_csv(csv_file_path, nrows=100) to infer data types.
  • Specify Data Types:
    • Based on the initial chunk, create a dictionary 'dtype_dict' that maps column names to appropriate data types (e.g., 'int32', 'float32', 'category').
  • Load Full CSV with Specified Data Types:
    • Use pd.read_csv(csv_file_path, dtype=dtype_dict) to load the full CSV file while specifying the data types to optimize memory usage.
  • Print Memory Usage:
    • Use df.info(memory_usage='deep') to print the memory usage of the DataFrame after optimization.

For more Practice: Solve these Related Problems:

  • Write a Pandas program to load a large CSV file by explicitly specifying data types for each column and measure memory usage.
  • Write a Pandas program to compare memory consumption when reading a CSV with default settings versus with optimized data types.
  • Write a Pandas program to load a CSV file and use the memory_usage() method to quantify the benefits of specifying data types.
  • Write a Pandas program to implement dtype specifications during CSV import and evaluate the impact on processing speed and memory.

Python-Pandas Code Editor:

Have another way to solve this solution? Contribute your code (and comments) through Disqus.

Previous: Compare performance of apply vs. Vectorized operations in Pandas.
Next: Reduce memory usage in Pandas DataFrame using astype method.

What is the difficulty level of this exercise?

Test your Programming skills with w3resource's quiz.



Follow us on Facebook and Twitter for latest update.