Imputing missing values using KNN imputation in Pandas
Pandas: Machine Learning Integration Exercise-11 with Solution
Write a Pandas program that imputes missing values using K-Nearest neighbours.
The following exercise demonstrates how to impute missing values using the K-Nearest Neighbors (KNN) algorithm.
Sample Solution :
Code :
import pandas as pd
from sklearn.impute import KNNImputer
# Load the dataset
df = pd.read_csv('data.csv')
# Separate the numeric columns (Age and Salary) from non-numeric ones (Name, Gender)
numeric_cols = ['Age', 'Salary']
non_numeric_cols = ['ID', 'Name', 'Gender', 'Target']
# Apply KNN imputation only to the numeric columns
imputer = KNNImputer(n_neighbors=3)
df_numeric_imputed = pd.DataFrame(imputer.fit_transform(df[numeric_cols]), columns=numeric_cols)
# Combine the non-numeric columns with the imputed numeric data
df_imputed = pd.concat([df[non_numeric_cols].reset_index(drop=True), df_numeric_imputed], axis=1)
# Output the dataset with imputed values
print(df_imputed)
Output:
ID Name Gender Target Age Salary 0 1 Sara Female 0 25.000000 50000.000000 1 2 Ophrah Male 1 30.000000 60000.000000 2 3 Torben Male 0 22.000000 70000.000000 3 4 Masaharu Male 1 35.000000 80000.000000 4 5 Kaya Female 0 25.666667 55000.000000 5 6 Abaddon Male 1 29.000000 63333.333333
Explanation:
- Import Libraries:
- pandas is imported for handling data in DataFrame format.
- KNNImputer from sklearn is imported for imputing missing values using K-Nearest Neighbors (KNN).
- Load Dataset:
- The data.csv file is read using pd.read_csv() and stored in a DataFrame df.
- Separate Numeric and Non-Numeric Columns:
- Two lists are created: numeric_cols containing the numeric columns ('Age', 'Salary') and non_numeric_cols containing non-numeric columns ('ID', 'Name', 'Gender', 'Target').
- Initialize and Apply KNN Imputer:
- KNNImputer is initialized with n_neighbors=3, meaning that the algorithm will use the 3 nearest neighbors to impute missing values.
- The fit_transform() method is applied to the numeric_cols ('Age' and 'Salary') to fill in the missing values, creating a DataFrame df_numeric_imputed with the imputed data.
- Combine Imputed Data with Non-Numeric Columns:
- The imputed numeric data (df_numeric_imputed) is combined with the original non-numeric columns (df[non_numeric_cols]) using pd.concat().
- The reset_index(drop=True) ensures that the indexes align properly after concatenation.
- Output the Final Dataset:
- The fully imputed dataset (df_imputed) is printed, containing both the non-numeric and imputed numeric data.
Python-Pandas Code Editor:
Have another way to solve this solution? Contribute your code (and comments) through Disqus.
What is the difficulty level of this exercise?
Test your Programming skills with w3resource's quiz.
- Weekly Trends and Language Statistics
- Weekly Trends and Language Statistics