Impact of Missing Values on Machine Learning Classification Using a Mean Imputation Strategy
Keywords:
Missing data, mean imputation, Random Forest, classification accuracy, Titanic datasetAbstract
Having missing data frequently leads to lower classification accuracy and adds bias to the learning process. Although there are several ways to impute missing data, their impact at different amounts of missingness is not explored enough, especially in Random Forest. The study is designed to (1) understand how much missing data affects the Random Forest algorithm’s evaluation on the Titanic dataset and (2) check if the algorithm’s accuracy can be preserved with mean imputation. We used mean imputation to fill in the missing data in the dataset to represent three different levels of how data could be missing. The Random Forest classifier with 100 decision trees is used for classification. Accuracy, precision, recall, and the F1-score are used to assess all the models, with a train-test split of 80% to 20% (random_state=42) for the same results every time. The research found that the precision of predictions decreases as the amount of missingness increases. 87.00% (20%), 85.33% (40%), and 79.29% (60%). There is little change in the precision (78%–86%), though the level of recall fell to 67.23% at 60%, reflecting reduced awareness of minority-class outcomes. Even with a basic imputation, Random Forest is able to outperform other methods like K-Nearest Neighbors imputation using the Instance-Based Learning algorithm IB3 (73% at 20%). According to these findings, mean imputation works well with missingness up to 20%, but shows its weakness at higher levels. It shows that Random Forest can handle missing data well and advises using advanced imputation methods, such as MICE or K-Nearest Neighbors, when the data has a lot of missing parts