The Effect of Combining Datasets in Diabetes Prediction Using Ensemble Learning Techniques

Emad Majeed Hameed; Hardik Joshi; Ahmed Abdul Azeez Ismael

doi:10.21512/commit.v19i1.12064

Authors

Emad Majeed Hameed Gujarat University and Middle Technical University
Hardik Joshi Gujarat University
Ahmed Abdul Azeez Ismael Middle Technical University

DOI:

https://doi.org/10.21512/commit.v19i1.12064

Keywords:

Combined Dataset, Diabetes Prediction, Ensemble Learning

Abstract

Diabetes prediction models often suffer from limited generalizability due to reliance on singlepopulation datasets, which fail to capture the diversity of real-world patient demographics. This limitation reduces their clinical applicability across different ethnic groups and geographic regions. The research aims to improve diabetes prediction accuracy and generalizability by combining multiple datasets and employing ensemble learning techniques, addressing the challenges of imbalanced data and population diversity. The research combines two publicly available datasets (Pima Indians: 768 samples and German Society: 2,000 samples) and utilizes preprocessing procedures conducted on these datasets. By comparing the performance of the individual dataset (Pima Indians and German Society datasets) and the combined dataset, it is clear that the models trained on the combined data show improved performance on all metrics. The Random Forest model outperforms the other ensemble models in the Pima Indians dataset, achieving an accuracy of 0.817. The models with the highest accuracy on the German Society dataset are Gradient Boosting and Random Forest, with respective accuracies of 0.996 and 0.994. Then, in the combined dataset, Gradient Boosting and Random Forest yield the best accuracy of 0.991 and 0.988, respectively. It is noticeable that this improvement reflects the ability of models trained on combined data to better accommodate diversity in the data, allowing them to generalize more effectively when applied to different populations. Future research should explore deep learning techniques and additional diverse datasets to enhance model performance further.

Dimensions

Author Biographies

Emad Majeed Hameed, Gujarat University and Middle Technical University

Department of Computer Science, Gujarat University

Baquba Technical Institute, Middle Technical University

Hardik Joshi, Gujarat University

Department of Computer Science

Ahmed Abdul Azeez Ismael, Middle Technical University

Baquba Technical Institute

References

[1] National Institute of Diabetes and Digestive and Kidney Diseases, “What is diabetes?” 2023. [Online]. Available: https://www.niddk.nih.gov/health-information/diabetes/overview/what-is-diabetes

[2] International Diabetes Federation (IDF), “IDF diabetes atlas 11th edition,” 2025. [Online]. Available: https://diabetesatlas.org/resources/idf-diabetes-atlas-2025/

[3] A. Saini, K. Guleria, and S. Sharma, “Predictive machine learning techniques for diabetes detection: An analytical comparison,” in 2023 2nd Edition of IEEE Delhi Section Flagship Conference (DELCON). Rajpura, India: IEEE, Feb. 24–26, 2023, pp. 1–5.

[4] A. Viloria, Y. Herazo-Beltran, D. Cabrera, and O. B. Pineda, “Diabetes diagnostic prediction using vector support machines,” Procedia Computer Science, vol. 170, pp. 376–381, 2020.

[5] A. Yaganteeswarudu, “Multi disease prediction model by using machine learning and Flask API,” in 2020 5th International Conference on Communication and Electronics Systems (ICCES). Coimbatore, India: IEEE, June 10–12, 2020, pp. 1242–1246.

[6] E. M. Hameed, I. S. Hussein, H. G. A. Altameemi, and Q. K. Kadhim, “Liver disease detection and prediction using SVM techniques,” in 2022 3rd Information Technology to Enhance ELearning and Other Application (IT-ELA). Baghdad, Iraq: IEEE, Dec. 27–28, 2022, pp. 61–66.

[7] R. Alhalaseh, D. A. G. AL-Mashhadany, and M. Abbadi, “The effect of feature selection on diabetes prediction using machine learning,” in 2023 IEEE Symposium on Computers and Communications (ISCC). Gammarth, Tunisia: IEEE, July 9–12, 2023, pp. 1–7.

[8] E. M. Hameed and H. Joshi, “Performance comparison of machine learning techniques in prediction of diabetes risk,” in AIP Conference Proceedings, vol. 3051, no. 1. Al-Samawa, Iraq: AIP Publishing, May 3–4, 2024.

[9] K. Oliullah, M. H. Rasel, M. M. Islam, M. R. Islam, M. A. H. Wadud, and M. Whaiduzzaman, “A stacked ensemble machine learning approach for the prediction of diabetes,” Journal of Diabetes & Metabolic Disorders, vol. 23, no. 1, pp. 603–617, 2024.

[10] Q. Zou, Y. Zhang, and C. S. Chen, “Construction and application of a machine learning prediction model based on unbalanced diabetes data fusion,” in Proceedings of the 2023 International Joint Conference on Robotics and Artificial Intelligence. Shanghai, China: Association for Computing Machinery, July 7–9, 2023, pp. 114–123.

[11] M. M. F. Islam, R. Ferdousi, S. Rahman, and H. Y. Bushra, “Likelihood prediction of diabetes at early stage using data mining techniques,” in Computer Vision and Machine Intelligence in Medical Image Analysis: International Symposium, ISCMM 2019. Sikkim, India: Springer, Feb. 26–27, 2020, pp. 113–125.

[12] E. M. Hameed, H. Joshi, and Q. K. Kadhim, “Advancements in artificial intelligence techniques for diabetes prediction: A comprehensive literature review,” Journal of Robotics and Control (JRC), vol. 6, no. 1, pp. 345–365, 2025.

[13] P. Chen and C. Pan, “Diabetes classification model based on boosting algorithms,” BMC Bioinformatics, vol. 19, pp. 1–9, 2018.

[14] S. Joshi and S. R. PriyankaShetty, “Performance analysis of different classification methods in data mining for diabetes dataset using WEKA tool,” International Journal on Recent and Innovation Trends in Computing and Communication, vol. 3, no. 3, pp. 1168–1173, 2015.

[15] A. Mujumdar and V. Vaidehi, “Diabetes prediction using machine learning algorithms,” Procedia Computer Science, vol. 165, pp. 292–299, 2019.

[16] P. Cıhan and H. Cos¸kun, “Performance comparison of machine learning models for diabetes prediction,” in 2021 29th Signal Processing and Communications Applications Conference (SIU). Istanbul, Turkey: IEEE, June 9–11 2021, pp. 1–4.

[17] V. Chang, J. Bailey, Q. A. Xu, and Z. Sun, “Pima Indians diabetes mellitus classification based on Machine Learning (ML) algorithms,” Neural Computing and Applications, vol. 35, no. 22, pp. 16 157–16 173, 2023.

[18] B. Farajollahi, M. Mehmannavaz, H. Mehrjoo, F. Moghbeli, and M. J. Sayadi, “Diabetes diagnosis using machine learning,” Frontiers in Health Informatics, vol. 10, no. 1, pp. 1–5, 2021.

[19] N. Sneha and T. Gangil, “Analysis of diabetes mellitus for early prediction using optimal features selection,” Journal of Big data, vol. 6, no. 1, pp. 1–19, 2019.

[20] S. Kumari and S. Kumar, “A comparative study of various data transformation techniques in data mining,” International Journal of Scientific Engineering and Technology, vol. 4, no. 3, pp. 146–148, 2015.

[21] U. Ahmed, G. F. Issa, M. A. Khan, S. Aftab, M. F. Khan, R. A. T. Said, T. M. Ghazal, and M. Ahmad, “Prediction of diabetes empowered with fused machine learning,” IEEE Access, vol. 10, pp. 8529–8538, 2022.

[22] A. G. Karegowda, V. Punya, M. A. Jayaram, and A. S. Manjunath, “Rule based classification for diabetic patients using cascaded k-means and decision tree C4.5,” International Journal of Computer Applications, vol. 45, no. 12, pp. 45–50, 2012.

[23] M. Marinov, A. S. M. Mosa, I. Yoo, and S. A. Boren, “Data-mining technologies for diabetes: A systematic review,” Journal of Diabetes Science and Technology, vol. 5, no. 6, pp. 1549–1556, 2011.

[24] John, “Diabetes.” [Online]. Available: https: //www.kaggle.com/datasets/johndasilva/diabetes [25] L. Hernandez, “Diabetes,” 2019. [Online]. Available: https://kaggle.com/competitions/diabetes

[26] A. A. Alhussan, A. A. Abdelhamid, S. K. Towfek, A. Ibrahim, M. M. Eid, D. S. Khafaga, and M. S. Saraya, “Classification of diabetes using feature selection and hybrid Al-Biruni earth radius and dipper throated optimization,” Diagnostics, vol. 13, no. 12, pp. 1–40, 2023.

[27] N. Pudjihartono, T. Fadason, A. W. Kempa-Liehr, and J. M. O’Sullivan, “A review of feature selection methods for machine learning-based disease risk prediction,” Frontiers in Bioinformatics, vol. 2, pp. 1–17, 2022.

[28] M. R. Alnowami, F. A. Abolaban, and E. Taha, “A wrapper-based feature selection approach to investigate potential biomarkers for early detection of breast cancer,” Journal of Radiation Research and Applied Sciences, vol. 15, no. 1, pp. 104–110, 2022.

[29] A. L. Lynam, “Developing clinical prediction models for diabetes classification and progression,” Ph.D. dissertation, University of Exeter, 2019.

[30] E. M. Hameed and H. Joshi, “Improving diabetes prediction by selecting optimal K and distance measures in KNN classifier,” Journal of Techniques, vol. 6, no. 3, pp. 19–25, 2024.

[31] F. Mesquita, J. Maur´ıcio, and G. Marques, “Oversampling techniques for diabetes classification: A comparative study,” in 2021 International Conference on e-Health and Bioengineering (EHB). Iasi, Romania: IEEE, Nov. 18–19, 2021, pp. 1–6.

[32] M. Shuja, S. Mittal, and M. Zaman, “Effective prediction of type II diabetes mellitus using data mining classifiers and SMOTE,” in Advances in Computing and Intelligent Systems: Proceedings of ICACM 2019. Rajasthan, India: Springer, April 13–14, 2020, pp. 195–211.

[33] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-Sampling Technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002.

[34] H. Ait Naceur, H. G. Abdo, B. Igmoullan, M. Namous, F. Alshehri, and J. A. Albanai, “Implementation of random forest, adaptive boosting, and gradient boosting decision trees algorithms for gully erosion susceptibility mapping using remote sensing and GIS,” Environmental Earth Sciences, vol. 83, no. 3, pp. 1–20, 2024.

[35] C. Bent´ejac, A. Cs¨org˝o, and G. Mart´ınez-Mu˜noz, “A comparative analysis of gradient boosting algorithms,” Artificial Intelligence Review, vol. 54, pp. 1937–1967, 2021.

[36] R. Natras, B. Soja, and M. Schmidt, “Ensemble machine learning of random forest, AdaBoost and XGBoost for vertical total electron content forecasting,” Remote Sensing, vol. 14, no. 15, pp. 1–34, 2022.

[37] R. Kumar, B. Rai, and P. Samui, “A comparative study of AdaBoost and k-nearest neighbor regressors for the prediction of compressive strength of ultra-high performance concrete,” in Recent Developments in Structural Engineering, Volume 1. Nagpur, India: Springer, Dec. 7–9, 2023, pp. 23–32.