Imputing missing values using KNN imputation in Pandas

Last update on May 06 2025 13:19:31 (UTC/GMT +8 hours)

11. Imputing Missing Values Using K-Nearest Neighbours

Write a Pandas program that imputes missing values using K-Nearest neighbours.

The following exercise demonstrates how to impute missing values using the K-Nearest Neighbors (KNN) algorithm.

Sample Solution :

Code :

import pandas as pd
from sklearn.impute import KNNImputer

# Load the dataset
df = pd.read_csv('data.csv')

# Separate the numeric columns (Age and Salary) from non-numeric ones (Name, Gender)
numeric_cols = ['Age', 'Salary']
non_numeric_cols = ['ID', 'Name', 'Gender', 'Target']

# Apply KNN imputation only to the numeric columns
imputer = KNNImputer(n_neighbors=3)
df_numeric_imputed = pd.DataFrame(imputer.fit_transform(df[numeric_cols]), columns=numeric_cols)

# Combine the non-numeric columns with the imputed numeric data
df_imputed = pd.concat([df[non_numeric_cols].reset_index(drop=True), df_numeric_imputed], axis=1)

# Output the dataset with imputed values
print(df_imputed)

Output:

   ID      Name  Gender  Target        Age        Salary
0   1      Sara  Female       0  25.000000  50000.000000
1   2    Ophrah    Male       1  30.000000  60000.000000
2   3    Torben    Male       0  22.000000  70000.000000
3   4  Masaharu    Male       1  35.000000  80000.000000
4   5      Kaya  Female       0  25.666667  55000.000000
5   6   Abaddon    Male       1  29.000000  63333.333333

Explanation:

Import Libraries:

pandas is imported for handling data in DataFrame format.
KNNImputer from sklearn is imported for imputing missing values using K-Nearest Neighbors (KNN).

Load Dataset:

The data.csv file is read using pd.read_csv() and stored in a DataFrame df.

Separate Numeric and Non-Numeric Columns:

Two lists are created: numeric_cols containing the numeric columns ('Age', 'Salary') and non_numeric_cols containing non-numeric columns ('ID', 'Name', 'Gender', 'Target').

Initialize and Apply KNN Imputer:

KNNImputer is initialized with n_neighbors=3, meaning that the algorithm will use the 3 nearest neighbors to impute missing values.
The fit_transform() method is applied to the numeric_cols ('Age' and 'Salary') to fill in the missing values, creating a DataFrame df_numeric_imputed with the imputed data.

Combine Imputed Data with Non-Numeric Columns:

The imputed numeric data (df_numeric_imputed) is combined with the original non-numeric columns (df[non_numeric_cols]) using pd.concat().
The reset_index(drop=True) ensures that the indexes align properly after concatenation.

Output the Final Dataset:

The fully imputed dataset (df_imputed) is printed, containing both the non-numeric and imputed numeric data.

For more Practice: Solve these Related Problems:

Write a Pandas program to impute missing values in a DataFrame using K-Nearest Neighbours based on similar rows.
Write a Pandas program to perform KNN imputation on a dataset and compare the imputed values with the original distribution.
Write a Pandas program to use KNN imputation for a dataset with both numeric and categorical variables.
Write a Pandas program to impute missing values using KNN and then evaluate the impact on a predictive model’s accuracy.

Go to:

Previous: Removing Outliers from a Dataset.
Next: Selecting Features Using Variance Threshold.

Python-Pandas Code Editor:

Have another way to solve this solution? Contribute your code (and comments) through Disqus.