
Imputing missing values using KNN imputation in Pandas

Pandas: Machine Learning Integration Exercise-11 with Solution

Write a Pandas program that imputes missing values using K-Nearest neighbours.

The following exercise demonstrates how to impute missing values using the K-Nearest Neighbors (KNN) algorithm.

Sample Solution :

Code :

import pandas as pd
from sklearn.impute import KNNImputer

# Load the dataset
df = pd.read_csv('data.csv')

# Separate the numeric columns (Age and Salary) from non-numeric ones (Name, Gender)
numeric_cols = ['Age', 'Salary']
non_numeric_cols = ['ID', 'Name', 'Gender', 'Target']

# Apply KNN imputation only to the numeric columns
imputer = KNNImputer(n_neighbors=3)
df_numeric_imputed = pd.DataFrame(imputer.fit_transform(df[numeric_cols]), columns=numeric_cols)

# Combine the non-numeric columns with the imputed numeric data
df_imputed = pd.concat([df[non_numeric_cols].reset_index(drop=True), df_numeric_imputed], axis=1)

# Output the dataset with imputed values


   ID      Name  Gender  Target        Age        Salary
0   1      Sara  Female       0  25.000000  50000.000000
1   2    Ophrah    Male       1  30.000000  60000.000000
2   3    Torben    Male       0  22.000000  70000.000000
3   4  Masaharu    Male       1  35.000000  80000.000000
4   5      Kaya  Female       0  25.666667  55000.000000
5   6   Abaddon    Male       1  29.000000  63333.333333


  • Import Libraries:
    • pandas is imported for handling data in DataFrame format.
    • KNNImputer from sklearn is imported for imputing missing values using K-Nearest Neighbors (KNN).
  • Load Dataset:
    • The data.csv file is read using pd.read_csv() and stored in a DataFrame df.
  • Separate Numeric and Non-Numeric Columns:
    • Two lists are created: numeric_cols containing the numeric columns ('Age', 'Salary') and non_numeric_cols containing non-numeric columns ('ID', 'Name', 'Gender', 'Target').
  • Initialize and Apply KNN Imputer:
    • KNNImputer is initialized with n_neighbors=3, meaning that the algorithm will use the 3 nearest neighbors to impute missing values.
    • The fit_transform() method is applied to the numeric_cols ('Age' and 'Salary') to fill in the missing values, creating a DataFrame df_numeric_imputed with the imputed data.
  • Combine Imputed Data with Non-Numeric Columns:
    • The imputed numeric data (df_numeric_imputed) is combined with the original non-numeric columns (df[non_numeric_cols]) using pd.concat().
    • The reset_index(drop=True) ensures that the indexes align properly after concatenation.
  • Output the Final Dataset:
    • The fully imputed dataset (df_imputed) is printed, containing both the non-numeric and imputed numeric data.

