Implementation of Random Forest Algorithm in Handling Imbalanced Data: A Study on Default Models and Hyperparameter Tuning

Ivan William Lianata; Kang Nicholas Darren Nugroho; Yosua Nathanael; Neilson Christopher; Edy Irwansyah

doi:10.21512/ijcshai.v2i2.14417

Authors

Ivan William Lianata Bina Nusantara University
Kang Nicholas Darren Nugroho Bina Nusantara University
Yosua Nathanael Bina Nusantara University
Neilson Christopher Bina Nusantara University
Edy Irwansyah Bina Nusantara University

DOI:

https://doi.org/10.21512/ijcshai.v2i2.14417

Keywords:

Imbalanced Data, Random Forest, Gradient Boosting, Hyperparameter Tuning, Diabetes Prediction

Abstract

The healthcare industry has benefited greatly from the quick development of artificial intelligence, especially machine learning (ML). Unbalanced data is a significant problem in medical classification, as it can impair model performance, particularly when it comes to identifying important minority classes like patients with particular diseases. The purpose of this research is to evaluate how well two ensemble-based algorithms—Random Forest and Gradient Boosting—perform when dealing with data imbalance in diabetes prediction. Age, body mass index, smoking history, HbA1c level, blood glucose level, and other demographic and medical variables are included in the dataset, which was acquired from Kaggle. Data preprocessing, train-test splitting, model implementation with default parameters, and hyperparameter tuning with Grid Search and Cross Validation comprise the methodology. Accuracy, precision, recall, F1-score, and AUC-ROC metrics were used to assess the model's performance. Both models achieved high accuracy above 97%, according to the results. Following tuning, Random Forest achieved 97.06% accuracy, 0.974 AUC, and 0.99 positive-class precision; however, recall somewhat declined, possibly resulting in underdiagnosis. Gradient Boosting, on the other hand, showed consistent performance with an AUC of 0.9791 and an F1-score of 0.81. These results demonstrate that model performance can be enhanced by hyperparameter tuning; however, algorithm selection should be based on the needs of the application, especially in medical settings where striking a balance between sensitivity and diagnostic precision is crucial.

Dimensions

Author Biographies

Ivan William Lianata, Bina Nusantara University

Data Science Program, School of Computer Science

Kang Nicholas Darren Nugroho, Bina Nusantara University

Data Science Program, School of Computer Science

Yosua Nathanael, Bina Nusantara University

Data Science Program, School of Computer Science

Neilson Christopher, Bina Nusantara University

Data Science Program, School of Computer Science

Edy Irwansyah, Bina Nusantara University

Department of Computer Science, School of Computer Science

References

[1] Theobald, O. (2021). Machine learning for absolute beginners: A plain English introduction (Third edition). Scatterplot Press.

[2] He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263-1284

[3] Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321-357.

[4] Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.

[5] Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794).

[6] Buda, M., Maki, A., & Mazurowski, M. A. (2018). A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks, 106, 249-259

[7] Bentéjac, C., Csörgő, A., & Martínez-Muñoz, G. (2020). A comparative analysis of gradient boosting algorithms. Artificial Intelligence Review, 54(3), 1937–1967. https://link.springer.com/article/10.1007/s10462-020-09896-5

[8] Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794).

[9] Das, A., & Roy, S. (2024). Evaluating ensemble models on imbalanced data sets: A comparative study across varied minority class ratios. ResearchGate. https://www.researchgate.net/publication/379484244_Evaluating_Ensemble_Models_on_Imbalanced_Data_Sets_A_Comparative_Study_across_Varied_Minority_Class_Ratios

[10] Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321-357.

[11] Wang, C., Deng, C., & Wang, S. (2019). Imbalance-XGBoost: Leveraging weighted and focal losses for binary label-imbalanced classification with XGBoost. Pattern Recognition Letters, 136, 190–197.

[12] Florek, P., & Zagdanski, A. (2023). Benchmarking state-of-the-art gradient boosting algorithms for classification. arXiv preprint arXiv:2305.17094. https://arxiv.org/abs/2305.17094

[13] Kuhn, M., & Johnson, K. (2013). Applied predictive modeling. Springer Science & Business Media.

[14] Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, 30.

[15] Kroese, D. P., Botev, Z. I., Taimre, T., & Vaisman, R. (2024). Data science and machine learning: Mathematical and statistical methods. The University of Queensland. https://people.smp.uq.edu.au/DirkKroese/DSML/DSML.pdf

Implementation of Random Forest Algorithm in Handling Imbalanced Data: A Study on Default Models and Hyperparameter Tuning

Authors

DOI:

Keywords:

Abstract

Author Biographies

Ivan William Lianata, Bina Nusantara University

Kang Nicholas Darren Nugroho, Bina Nusantara University

Yosua Nathanael, Bina Nusantara University

Neilson Christopher, Bina Nusantara University

Edy Irwansyah, Bina Nusantara University

References

Downloads

Published

How to Cite

Issue

Section

License

sidebarmenu

toolsijschai

visitors