A Comparative Study of Machine Learning and Stacking Ensemble Models for Diabetes Prediction
DOI:
https://doi.org/10.21512/emacsjournal.v8i1.15332Keywords:
Diabetes Prediction, Machine Learning, Prima Indians Dataset, Ensamble ModelAbstract
Diabetes is a chronic metabolic disease and an increasingly widespread disease around the world, and early diagnosis is crucial. Methodology In this study, the performance of three machine learning models (Logistic Regression, K-Nearest Neighbors (KNN), and Naive Bayes) is reviewed under the task of diabetes classification using the Pima Indians Diabetes Dataset. To tackle the class imbalance, we applied imputation, SMOTE for the data preprocessing, and min-max scaling to enhance the prediction performance. Further, we have applied ensemble learning and stacking, where all three models have been used as meta-classifiers. The results indicate that KNN had the best individual model performance (accuracy 77.27%, AUC 0.8444), but the stacking ensemble with a meta-model being Logistic Regression is superior to any model (accuracy 80.52%, AUC 0.8604). This suggests that ensemble learning can also improve the accuracy of diabetes diagnosis. These findings demonstrate that combining multiple classification approaches may provide more stable predictions across different patient conditions and clinical attributes. In addition, the preprocessing stages contributed to reducing noise and improving data consistency before model training. The study also highlights the potential use of ensemble-based systems in supporting healthcare professionals during preliminary diabetes screening, particularly in environments with limited medical resources and increasing numbers of diabetes cases requiring rapid assessment.
References
Ali, A., Alrubei, M., Hassan, L. F. M., Al-Ja'afari, M., & Abdulwahed, S. (2020). Diabetes classification based on KNN. IIUM Engineering Journal, 21(1), 175–181. https://doi.org/10.31436/iiumej.v21i1.1206
Arrayyan, A. Z., Setiawan, H., & Putra, K. T. (2024). Naive Bayes for diabetes prediction: Developing a classification model for risk identification in specific populations. Semesta Teknika, 27(1), 28–36. https://doi.org/10.18196/st.v27i1.21008
Daghistani, T., & Alshammari, R. (2020). Comparison of statistical logistic regression and RandomForest machine learning techniques in predicting diabetes. Journal of Advances in Information Technology, 11(2), 78–83. https://doi.org/10.12720/jait.11.2.78-83
Ghosh, P., Azam, S., Karim, A., Hassan, M., Roy, K., & Jonkman, M. (2021). A comparative study of different machine learning tools in detecting diabetes. Procedia Computer Science, 192, 467–477. https://doi.org/10.1016/j.procs.2021.08.048
Hassan, M., Butt, M. A., & Baba, M. Z. (2017). Logistic regression versus neural networks: The best accuracy in prediction of diabetes disease. Asian Journal of Computer Science and Technology, 6(2), 33–42. https://doi.org/10.51983/ajcst-2017.6.2.1782
Joshi, T. N., & Chawan, P. M. (2018). Logistic regression and SVM based diabetes prediction system. International Journal For Technological Research In Engineering, 5(11), 4347–4350.
Khanam, J. J., & Foo, S. Y. (2021). A comparison of machine learning algorithms for diabetes prediction. ICT Express, 7(4), 432–439. https://doi.org/10.1016/j.icte.2021.02.004
Kibria, H. B., Nahiduzzaman, M., Goni, M. O. F., Ahsan, M., & Haider, J. (2022). An ensemble approach for the prediction of diabetes mellitus using a soft voting classifier with an explainable AI. Sensors, 22(19), 7268. https://doi.org/10.3390/s22197268
Okikiola, F. M., Adewale, O. S., & Obe, O. O. (2023). A diabetes prediction classifier model using Naive Bayes algorithm. FUDMA Journal of Sciences, 7(1), 253–260. https://doi.org/10.33003/fjs-2023-0701-1301
Ooka, T., et al. (2021). Random forest approach for determining risk prediction and predictive factors of type 2 diabetes: Large-scale health check-up data in Japan. BMJ Nutrition, Prevention & Health, 4(1). https://doi.org/10.1136/bmjnph-2020-000200
Pertiwi, A. G., Bachtiar, N., Kusumaningrum, R., Waspada, I., & Wibowo, A. (2020). Comparison of performance of k-nearest neighbor algorithm using SMOTE and without SMOTE in diagnosis of diabetes disease in balanced data. Journal of Physics: Conference Series, 1524, 012048. https://doi.org/10.1088/1742-6596/1524/1/012048
Rajendra, P., & Latifi, S. (2021). Prediction of diabetes using logistic regression and ensemble techniques. Computer Methods and Programs in Biomedicine Update, 1, 100032. https://doi.org/10.1016/j.cmpbup.2021.100032
Rikatsih, N., Anshori, M., Pradini, R. S., & Faurika. (2024). K-Nearest Neighbor method for early detection of diabetes patients based on symptoms and clinical data. Inform: Jurnal Ilmiah Bidang Teknologi Informasi dan Komunikasi, 9(2), 187–192. https://doi.org/10.25139/inform.v9i2.8582
Rousyati, R., Rais, A. N., Rahmawati, E., & Amir, R. F. (2021). Prediksi Pima Indians Diabetes Database dengan ensemble Adaboost dan Bagging. EVOLUSI: Jurnal Sains dan Manajemen, 9(2), 36–42. https://doi.org/10.31294/evolusi.v9i2.11159
Samet, S., Laouar, M. R., & Bendib, I. (2021). Use of machine learning techniques to predict diabetes at an early stage. Proceedings of the International Conference on Networking and Advanced Systems (ICNAS). https://doi.org/10.1109/ICNAS53565.2021.9628903
Sopharak, A., Nwe, K. T., Moe, Y. A., Dailey, M. N., & Uyyanonvara, B. (2023). Automatic exudate detection with a Naive Bayes classifier. 13th International Conference on Engineering, Science and Information Technology (ICESIT).
UCI Machine Learning Repository. (2016). Pima Indians diabetes database [Dataset]. Kaggle. https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Bakti Jabar, Albertus Januario, Davin Miguel Sanjaya, James Tanuwijaya

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License - Share Alike that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.
USER RIGHTS
All articles published Open Access will be immediately and permanently free for everyone to read and download. We are continuously working with our author communities to select the best choice of license options, currently being defined for this journal as follows: Creative Commons Attribution-Share Alike (CC BY-SA)


