Pandas - Detecting and removing outliers in a DataFrame using Z-score
Pandas: Data Cleaning and Preprocessing Exercise-5 with Solution
Write a Pandas program to handle outliers in a DataFrame with Z-Score method.
This exercise demonstrates how to identify and remove outliers from a DataFrame using the Z-score method.
Sample Solution :
Code :
import pandas as pd
# Create a sample DataFrame with outliers
df = pd.DataFrame({
'Name': ['David', 'Annabel', 'Charlie', 'David'],
'Age': [25, 30, 22, 99] # '99' is an outlier
})
# Calculate Z-scores to identify outliers
mean_age = df['Age'].mean()
std_age = df['Age'].std()
df['Z_Score'] = (df['Age'] - mean_age) / std_age
# Remove rows where Z-score is above 2 or below -2 (outliers)
df_no_outliers = df[df['Z_Score'].abs() <= 2]
# Drop the Z_Score column
df_no_outliers = df_no_outliers.drop(columns='Z_Score')
# Output the result
print(df_no_outliers)
Output:
Name Age 0 David 25 1 Annabel 30 2 Charlie 22 3 David 99
Explanation:
- Created a DataFrame with an outlier in the 'Age' column (99).
- Calculated Z-scores to identify outliers by comparing each value to the mean and standard deviation.
- Removed rows with Z-scores greater than 2 or less than -2 (indicating outliers).
- Dropped the Z-score column and returned the DataFrame without outliers.
Python-Pandas Code Editor:
Have another way to solve this solution? Contribute your code (and comments) through Disqus.
What is the difficulty level of this exercise?
Test your Programming skills with w3resource's quiz.
- Weekly Trends and Language Statistics
- Weekly Trends and Language Statistics