Impact of Missing Values on Machine Learning Classification Using a Mean Imputation Strategy

Fazal Malik; Muhammad Suliman; Atiq Ur Rahman; Rahmat Hussain; Muhammad Javed; Ashraf Ullah; Afsheen Khalid

Authors

Fazal Malik Department of Computer Science, Iqra National University Peshawar, Khyber Pakhtunkhwa (KPK), Pakistan
Muhammad Suliman
Atiq Ur Rahman
Rahmat Hussain
Muhammad Javed
Ashraf Ullah
Afsheen Khalid

Keywords:

Missing data, mean imputation, Random Forest, classification accuracy, Titanic dataset

Abstract

Having missing data frequently leads to lower classification accuracy and adds bias to the learning process. Although there are several ways to impute missing data, their impact at different amounts of missingness is not explored enough, especially in Random Forest. The study is designed to (1) understand how much missing data affects the Random Forest algorithm’s evaluation on the Titanic dataset and (2) check if the algorithm’s accuracy can be preserved with mean imputation. We used mean imputation to fill in the missing data in the dataset to represent three different levels of how data could be missing. The Random Forest classifier with 100 decision trees is used for classification. Accuracy, precision, recall, and the F1-score are used to assess all the models, with a train-test split of 80% to 20% (random_state=42) for the same results every time. The research found that the precision of predictions decreases as the amount of missingness increases. 87.00% (20%), 85.33% (40%), and 79.29% (60%). There is little change in the precision (78%–86%), though the level of recall fell to 67.23% at 60%, reflecting reduced awareness of minority-class outcomes. Even with a basic imputation, Random Forest is able to outperform other methods like K-Nearest Neighbors imputation using the Instance-Based Learning algorithm IB3 (73% at 20%). According to these findings, mean imputation works well with missingness up to 20%, but shows its weakness at higher levels. It shows that Random Forest can handle missing data well and advises using advanced imputation methods, such as MICE or K-Nearest Neighbors, when the data has a lot of missing parts

Impact of Missing Values on Machine Learning Classification Using a Mean Imputation Strategy

Authors

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Most read articles by the same author(s)

Similar Articles

Journal Information

Indexing

Flag Counter