Impact of Missing Values on Machine Learning Classification Using a Mean Imputation Strategy

Authors

  • Fazal Malik Department of Computer Science, Iqra National University Peshawar, Khyber Pakhtunkhwa (KPK), Pakistan
  • Muhammad Suliman
  • Atiq Ur Rahman
  • Rahmat Hussain
  • Muhammad Javed
  • Ashraf Ullah
  • Afsheen Khalid

Keywords:

Missing data, mean imputation, Random Forest, classification accuracy, Titanic dataset

Abstract

Having missing data frequently leads to lower classification accuracy and adds bias to the learning process. Although there are several ways to impute missing data, their impact at different amounts of missingness is not explored enough, especially in Random Forest. The study is designed to (1) understand how much missing data affects the Random Forest algorithm’s evaluation on the Titanic dataset and (2) check if the algorithm’s accuracy can be preserved with mean imputation. We used mean imputation to fill in the missing data in the dataset to represent three different levels of how data could be missing. The Random Forest classifier with 100 decision trees is used for classification. Accuracy, precision, recall, and the F1-score are used to assess all the models, with a train-test split of 80% to 20% (random_state=42) for the same results every time. The research found that the precision of predictions decreases as the amount of missingness increases. 87.00% (20%), 85.33% (40%), and 79.29% (60%). There is little change in the precision (78%–86%), though the level of recall fell to 67.23% at 60%, reflecting reduced awareness of minority-class outcomes. Even with a basic imputation, Random Forest is able to outperform other methods like K-Nearest Neighbors imputation using the Instance-Based Learning algorithm IB3 (73% at 20%). According to these findings, mean imputation works well with missingness up to 20%, but shows its weakness at higher levels. It shows that Random Forest can handle missing data well and advises using advanced imputation methods, such as MICE or K-Nearest Neighbors, when the data has a lot of missing parts

Downloads

Published

2025-06-05

How to Cite

Malik, F., Muhammad Suliman, Atiq Ur Rahman, Rahmat Hussain, Muhammad Javed, Ashraf Ullah, & Afsheen Khalid. (2025). Impact of Missing Values on Machine Learning Classification Using a Mean Imputation Strategy. Dialogue Social Science Review (DSSR), 3(6`), 221–242. Retrieved from https://dialoguessr.com/index.php/2/article/view/580

Issue

Section

Articles

Similar Articles

<< < 13 14 15 16 17 18 19 20 21 22 > >> 

You may also start an advanced similarity search for this article.