w3resource

Imputing missing values using KNN imputation in Pandas


Pandas: Machine Learning Integration Exercise-11 with Solution


Write a Pandas program that imputes missing values using K-Nearest neighbours.

The following exercise demonstrates how to impute missing values using the K-Nearest Neighbors (KNN) algorithm.

Sample Solution :

Code :

import pandas as pd
from sklearn.impute import KNNImputer

# Load the dataset
df = pd.read_csv('data.csv')

# Separate the numeric columns (Age and Salary) from non-numeric ones (Name, Gender)
numeric_cols = ['Age', 'Salary']
non_numeric_cols = ['ID', 'Name', 'Gender', 'Target']

# Apply KNN imputation only to the numeric columns
imputer = KNNImputer(n_neighbors=3)
df_numeric_imputed = pd.DataFrame(imputer.fit_transform(df[numeric_cols]), columns=numeric_cols)

# Combine the non-numeric columns with the imputed numeric data
df_imputed = pd.concat([df[non_numeric_cols].reset_index(drop=True), df_numeric_imputed], axis=1)

# Output the dataset with imputed values
print(df_imputed)

Output:

   ID      Name  Gender  Target        Age        Salary
0   1      Sara  Female       0  25.000000  50000.000000
1   2    Ophrah    Male       1  30.000000  60000.000000
2   3    Torben    Male       0  22.000000  70000.000000
3   4  Masaharu    Male       1  35.000000  80000.000000
4   5      Kaya  Female       0  25.666667  55000.000000
5   6   Abaddon    Male       1  29.000000  63333.333333

Explanation:

  • Import Libraries:
    • pandas is imported for handling data in DataFrame format.
    • KNNImputer from sklearn is imported for imputing missing values using K-Nearest Neighbors (KNN).
  • Load Dataset:
    • The data.csv file is read using pd.read_csv() and stored in a DataFrame df.
  • Separate Numeric and Non-Numeric Columns:
    • Two lists are created: numeric_cols containing the numeric columns ('Age', 'Salary') and non_numeric_cols containing non-numeric columns ('ID', 'Name', 'Gender', 'Target').
  • Initialize and Apply KNN Imputer:
    • KNNImputer is initialized with n_neighbors=3, meaning that the algorithm will use the 3 nearest neighbors to impute missing values.
    • The fit_transform() method is applied to the numeric_cols ('Age' and 'Salary') to fill in the missing values, creating a DataFrame df_numeric_imputed with the imputed data.
  • Combine Imputed Data with Non-Numeric Columns:
    • The imputed numeric data (df_numeric_imputed) is combined with the original non-numeric columns (df[non_numeric_cols]) using pd.concat().
    • The reset_index(drop=True) ensures that the indexes align properly after concatenation.
  • Output the Final Dataset:
    • The fully imputed dataset (df_imputed) is printed, containing both the non-numeric and imputed numeric data.

Python-Pandas Code Editor:

Have another way to solve this solution? Contribute your code (and comments) through Disqus.

What is the difficulty level of this exercise?

Test your Programming skills with w3resource's quiz.



Follow us on Facebook and Twitter for latest update.