Implementation of Random Forest Algorithm in Handling Imbalanced Data: A Study on Default Models and Hyperparameter Tuning
DOI:
https://doi.org/10.21512/ijcshai.v2i2.14417Keywords:
Imbalanced Data, Random Forest, Gradient Boosting, Hyperparameter Tuning, Diabetes PredictionAbstract
The healthcare industry has benefited greatly from the quick development of artificial intelligence, especially machine learning (ML). Unbalanced data is a significant problem in medical classification, as it can impair model performance, particularly when it comes to identifying important minority classes like patients with particular diseases. The purpose of this research is to evaluate how well two ensemble-based algorithms—Random Forest and Gradient Boosting—perform when dealing with data imbalance in diabetes prediction. Age, body mass index, smoking history, HbA1c level, blood glucose level, and other demographic and medical variables are included in the dataset, which was acquired from Kaggle. Data preprocessing, train-test splitting, model implementation with default parameters, and hyperparameter tuning with Grid Search and Cross Validation comprise the methodology. Accuracy, precision, recall, F1-score, and AUC-ROC metrics were used to assess the model's performance. Both models achieved high accuracy above 97%, according to the results. Following tuning, Random Forest achieved 97.06% accuracy, 0.974 AUC, and 0.99 positive-class precision; however, recall somewhat declined, possibly resulting in underdiagnosis. Gradient Boosting, on the other hand, showed consistent performance with an AUC of 0.9791 and an F1-score of 0.81. These results demonstrate that model performance can be enhanced by hyperparameter tuning; however, algorithm selection should be based on the needs of the application, especially in medical settings where striking a balance between sensitivity and diagnostic precision is crucial.
References
[1] Theobald, O. (2021). Machine learning for absolute beginners: A plain English introduction (Third edition). Scatterplot Press.
[2] He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263-1284
[3] Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321-357.
[4] Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.
[5] Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794).
[6] Buda, M., Maki, A., & Mazurowski, M. A. (2018). A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks, 106, 249-259
[7] Bentéjac, C., Csörgő, A., & Martínez-Muñoz, G. (2020). A comparative analysis of gradient boosting algorithms. Artificial Intelligence Review, 54(3), 1937–1967. https://link.springer.com/article/10.1007/s10462-020-09896-5
[8] Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794).
[9] Das, A., & Roy, S. (2024). Evaluating ensemble models on imbalanced data sets: A comparative study across varied minority class ratios. ResearchGate. https://www.researchgate.net/publication/379484244_Evaluating_Ensemble_Models_on_Imbalanced_Data_Sets_A_Comparative_Study_across_Varied_Minority_Class_Ratios
[10] Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321-357.
[11] Wang, C., Deng, C., & Wang, S. (2019). Imbalance-XGBoost: Leveraging weighted and focal losses for binary label-imbalanced classification with XGBoost. Pattern Recognition Letters, 136, 190–197.
[12] Florek, P., & Zagdanski, A. (2023). Benchmarking state-of-the-art gradient boosting algorithms for classification. arXiv preprint arXiv:2305.17094. https://arxiv.org/abs/2305.17094
[13] Kuhn, M., & Johnson, K. (2013). Applied predictive modeling. Springer Science & Business Media.
[14] Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, 30.
[15] Kroese, D. P., Botev, Z. I., Taimre, T., & Vaisman, R. (2024). Data science and machine learning: Mathematical and statistical methods. The University of Queensland. https://people.smp.uq.edu.au/DirkKroese/DSML/DSML.pdf
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Ivan William Lianata, Kang Nicholas Darren Nugroho, Yosua Nathanael, Neilson Christopher, Edy Irwansyah

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License - Share Alike that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.
USER RIGHTS
All articles published Open Access will be immediately and permanently free for everyone to read and download. We are continuously working with our author communities to select the best choice of license options, currently being defined for this journal as follows: Creative Commons Attribution-Share Alike (CC BY-SA)



