A Comparative Study of Machine Learning and Stacking Ensemble Models for Diabetes Prediction

Authors

  • Bakti Amirul Jabar Bina Nusantara University
  • Albertus Januario Bina Nusantara University
  • Davin Miguel Sanjaya Bina Nusantara University
  • James Tanuwijaya Bina Nusantara University

DOI:

https://doi.org/10.21512/emacsjournal.v8i1.15332

Keywords:

Diabetes Prediction, Machine Learning, Prima Indians Dataset, Ensamble Model

Abstract

Diabetes is a chronic metabolic disease and an increasingly widespread disease around the world, and early diagnosis is crucial. Methodology In this study, the performance of three machine learning models (Logistic Regression, K-Nearest Neighbors (KNN), and Naive Bayes) is reviewed under the task of diabetes classification using the Pima Indians Diabetes Dataset. To tackle the class imbalance, we applied imputation, SMOTE for the data preprocessing, and min-max scaling to enhance the prediction performance. Further, we have applied ensemble learning and stacking, where all three models have been used as meta-classifiers. The results indicate that KNN had the best individual model performance (accuracy 77.27%, AUC 0.8444), but the stacking ensemble with a meta-model being Logistic Regression is superior to any model (accuracy 80.52%, AUC 0.8604). This suggests that ensemble learning can also improve the accuracy of diabetes diagnosis. These findings demonstrate that combining multiple classification approaches may provide more stable predictions across different patient conditions and clinical attributes. In addition, the preprocessing stages contributed to reducing noise and improving data consistency before model training. The study also highlights the potential use of ensemble-based systems in supporting healthcare professionals during preliminary diabetes screening, particularly in environments with limited medical resources and increasing numbers of diabetes cases requiring rapid assessment.

Dimensions

Author Biographies

Albertus Januario, Bina Nusantara University

Computer Science Department, School of Computer Science

Davin Miguel Sanjaya, Bina Nusantara University

Computer Science Department, School of Computer Science

James Tanuwijaya, Bina Nusantara University

Computer Science Department, School of Computer Science

References

Ali, A., Alrubei, M., Hassan, L. F. M., Al-Ja'afari, M., & Abdulwahed, S. (2020). Diabetes classification based on KNN. IIUM Engineering Journal, 21(1), 175–181. https://doi.org/10.31436/iiumej.v21i1.1206

Arrayyan, A. Z., Setiawan, H., & Putra, K. T. (2024). Naive Bayes for diabetes prediction: Developing a classification model for risk identification in specific populations. Semesta Teknika, 27(1), 28–36. https://doi.org/10.18196/st.v27i1.21008

Daghistani, T., & Alshammari, R. (2020). Comparison of statistical logistic regression and RandomForest machine learning techniques in predicting diabetes. Journal of Advances in Information Technology, 11(2), 78–83. https://doi.org/10.12720/jait.11.2.78-83

Ghosh, P., Azam, S., Karim, A., Hassan, M., Roy, K., & Jonkman, M. (2021). A comparative study of different machine learning tools in detecting diabetes. Procedia Computer Science, 192, 467–477. https://doi.org/10.1016/j.procs.2021.08.048

Hassan, M., Butt, M. A., & Baba, M. Z. (2017). Logistic regression versus neural networks: The best accuracy in prediction of diabetes disease. Asian Journal of Computer Science and Technology, 6(2), 33–42. https://doi.org/10.51983/ajcst-2017.6.2.1782

Joshi, T. N., & Chawan, P. M. (2018). Logistic regression and SVM based diabetes prediction system. International Journal For Technological Research In Engineering, 5(11), 4347–4350.

Khanam, J. J., & Foo, S. Y. (2021). A comparison of machine learning algorithms for diabetes prediction. ICT Express, 7(4), 432–439. https://doi.org/10.1016/j.icte.2021.02.004

Kibria, H. B., Nahiduzzaman, M., Goni, M. O. F., Ahsan, M., & Haider, J. (2022). An ensemble approach for the prediction of diabetes mellitus using a soft voting classifier with an explainable AI. Sensors, 22(19), 7268. https://doi.org/10.3390/s22197268

Okikiola, F. M., Adewale, O. S., & Obe, O. O. (2023). A diabetes prediction classifier model using Naive Bayes algorithm. FUDMA Journal of Sciences, 7(1), 253–260. https://doi.org/10.33003/fjs-2023-0701-1301

Ooka, T., et al. (2021). Random forest approach for determining risk prediction and predictive factors of type 2 diabetes: Large-scale health check-up data in Japan. BMJ Nutrition, Prevention & Health, 4(1). https://doi.org/10.1136/bmjnph-2020-000200

Pertiwi, A. G., Bachtiar, N., Kusumaningrum, R., Waspada, I., & Wibowo, A. (2020). Comparison of performance of k-nearest neighbor algorithm using SMOTE and without SMOTE in diagnosis of diabetes disease in balanced data. Journal of Physics: Conference Series, 1524, 012048. https://doi.org/10.1088/1742-6596/1524/1/012048

Rajendra, P., & Latifi, S. (2021). Prediction of diabetes using logistic regression and ensemble techniques. Computer Methods and Programs in Biomedicine Update, 1, 100032. https://doi.org/10.1016/j.cmpbup.2021.100032

Rikatsih, N., Anshori, M., Pradini, R. S., & Faurika. (2024). K-Nearest Neighbor method for early detection of diabetes patients based on symptoms and clinical data. Inform: Jurnal Ilmiah Bidang Teknologi Informasi dan Komunikasi, 9(2), 187–192. https://doi.org/10.25139/inform.v9i2.8582

Rousyati, R., Rais, A. N., Rahmawati, E., & Amir, R. F. (2021). Prediksi Pima Indians Diabetes Database dengan ensemble Adaboost dan Bagging. EVOLUSI: Jurnal Sains dan Manajemen, 9(2), 36–42. https://doi.org/10.31294/evolusi.v9i2.11159

Samet, S., Laouar, M. R., & Bendib, I. (2021). Use of machine learning techniques to predict diabetes at an early stage. Proceedings of the International Conference on Networking and Advanced Systems (ICNAS). https://doi.org/10.1109/ICNAS53565.2021.9628903

Sopharak, A., Nwe, K. T., Moe, Y. A., Dailey, M. N., & Uyyanonvara, B. (2023). Automatic exudate detection with a Naive Bayes classifier. 13th International Conference on Engineering, Science and Information Technology (ICESIT).

UCI Machine Learning Repository. (2016). Pima Indians diabetes database [Dataset]. Kaggle. https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database

Downloads

Published

2026-05-15

How to Cite

Jabar, B. A., Januario, A., Sanjaya, D. M., & Tanuwijaya, J. (2026). A Comparative Study of Machine Learning and Stacking Ensemble Models for Diabetes Prediction. Engineering, MAthematics and Computer Science Journal (EMACS), 8(1), 65–71. https://doi.org/10.21512/emacsjournal.v8i1.15332

Issue

Section

Articles
Abstract 83  .
PDF downloaded 4  .