Prediction of Undergraduate Studentâ€™s Study Completion Status Using MissForest Imputation in Random Forest and XGBoost Models

Intan Nirmala; Hari Wijayanto; Khairil Anwar Notodiputro

doi:10.21512/comtech.v13i1.7388

Authors

Intan Nirmala Institut Pertanian Bogor
Hari Wijayanto Institut Pertanian Bogor
Khairil Anwar Notodiputro Institut Pertanian Bogor

DOI:

https://doi.org/10.21512/comtech.v13i1.7388

Keywords:

study completion status, MissForest imputation, Random-Forest model, XGBoost model

Abstract

The number of higher education graduates in Indonesia is calculated based on their completion status. However, many undergraduate students have reached the maximum length of study, but their completion status is unknown. This condition becomes a problem in calculating the actual number of graduates as it is used as an indicator of higher education evaluation and other policy references. Therefore, the unknown completion status of the students who have reached the maximum length of study must be predicted. The research compared the performance of Random Forest and Extreme Gradient Boosting (XGBoost) classification models in predicting the unknown completion status. The research used a dataset containing 13.377 undergraduate studentsâ€™ profiles from the Higher Education Database (PDDikti), Ministry of Education, Culture, Research, and Technology. The dataset was incomplete, and the proportion of missing data was 20,9% of the total data. Because missing data might lead to prediction bias, the research also used MissForest imputation to overcome the missing data in the classification modelling and compared it to Mean/Mode and Median/Mode imputation. The results show that MissForest outperforms the other two imputations in both classifiers but requires the longest computation time. Furthermore, the XGBoost model with MissForest is significantly superior to the Random Forest model with MissForest. Hence, the best model chosen to predict the completion status is XGBoost with MissForest imputation.

Dimensions

Author Biographies

Intan Nirmala, Institut Pertanian Bogor

Department of Statistics

Hari Wijayanto, Institut Pertanian Bogor

Department of Statistics

Khairil Anwar Notodiputro, Institut Pertanian Bogor

Department of Statistics

References

Acuña, E., & Rodriguez, C. (2004). The treatment of missing values and its effect on classifier accuracy. In Classification, clustering, and data mining applications (pp. 639−647). Berlin: Springer.

Ahmad, M. W., Mourshed, M., & Rezgui, Y. (2018). Tree-based ensemble methods for predicting PV power generation and their comparison with support vector regression. Energy, 164, 465–474. https://doi.org/10.1016/j.energy.2018.08.207

Alsaber, A. R., Pan, J., & Al-Hurban, A. (2021). Handling complex missing data using random forest approach for an air quality monitoring dataset: A case study of Kuwait environmental data (2012 to 2018). International Journal of Environmental Research and Public Health, 18(3), 1–25. https://doi.org/10.3390/ijerph18031333

Alturki, S., Hulpuș, I., & Stuckenschmidt, H. (2020). Predicting academic outcomes: A survey from 2007 till 2018. Technology, Knowledge and Learning, 27, 275–307. https://doi.org/10.1007/s10758-020-09476-0

Aminu, A. A., Abdulkarim, A., Aliyu, A. Y., Aliyu, M., & Turaki, A. M. (2019). Detection of phishing websites using Random Forest and XGBoost algorithms. International Journal of Pure and Applied Sciences, 2(3), 1–14.

Anwar, M. T., Winarno, E., Hadikurniawati, W., & Novita, M. (2021). Rainfall prediction using Extreme Gradient Boosting. Journal of Physics: Conference Series, 1869, 1–5. https://doi.org/10.1088/1742-6596/1869/1/012078

Baruah, E. A., Baruah, S., & Goswami, J. A. (2020). Comparative analysis of different classification algorithms based on students’ academic performance using WEKA. IOSR Journal of Computer Engineering (IOSR-JCE), 22(1), 49–56.

Blazek, K., Zwieten, A. V., Saglimbene, V., & Teixeira-Pinto, A. (2021). A practical guide to multiple imputation of missing data in nephrology. Kidney International, 99(1), 68–74. https://doi.org/10.1016/j.kint.2020.07.035

Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5−32. https://doi.org/10.1023/A:1010933404324

Cihan, P., Kalıpsız, O., & Gökçe, E. (2019). Effect of imputation methods in the classifier performance. Sakarya University Journal of Science, 23(6), 1225–1236.

Costa, F. J. D., Bispo, M. D. S., & Pereira, R. D. C. D. F. (2018). Dropout and retention of undergraduate students in management: A study at a Brazilian Federal University. RAUSP Management Journal, 53(1), 74–85 https://doi.org/10.1016/j.rauspm.2017.12.007

Hussain, S., Dahan, N. A., Ba-Alwib, F. M., & Ribata, N. (2018). Educational data mining and analysis of students’ academic performance using WEKA. Indonesian Journal of Electrical Engineering and Computer Science, 9(2), 447–459.

Khan, F. U. F., Khan, K. U. Z., & Singh, S. K. (2018). Is Group Means imputation any better than Mean imputation: A study using C5.0 classifier. Journal of Physics: Conference Series, 1060, 1‒5. https://doi.org/10.1088/1742-6596/1060/1/012014

Khan, S. I., & Hoque, A. S. M. L. (2020). SICE: An improved missing data imputation technique. Journal of Big Data, 7(1), 1–21.

Kim, A. S. N., Shakory, S., Azad, A., Popovic, C., & Park, L. (2020). Understanding the impact of attendance and participation on academic achievement. Scholarship of Teaching and Learning in Psychology, 6(4), 272–284. https://doi.org/10.1037/STL0000151

Kokla, M., Virtanen, J., Kolehmainen, M., Paananen, J., & Hanhineva, K. (2019). Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: A comparative study. BMC Bioinformatics, 20, 1−10. https://doi.org/10.1186/s12859-019-3110-0

Köse, T., Özgür, S., Coşgun, E., Keskinoğlu, A., & Keskinoğlu, P. (2020). Effect of missing data imputation on deep learning prediction performance for vesicoureteral reflux and recurrent urinary tract infection clinical study. BioMed Research International, 2020, 1‒15. https://doi.org/10.1155/2020/1895076

Kurniawan, D., Anggrawan, A., & Hairani. (2020). Graduation prediction system on students using C4.5 algorithm. MATRIK: Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer, 19(2), 358‒366. https://doi.org/10.30812/matrik.v19i2.685

Manimekalai, K., & Kavitha, A. (2018). Missing value imputation and normalization techniques in myocardial infarction. ICTACT Journal on Soft Computing, 8(03), 1655‒1662.

Marcinkevics, R., Reis Wolfertstetter, P., Wellmann, S., Knorr, C., & Vogt, J. E. (2021). Using machine learning to predict the diagnosis, management and severity of pediatric appendicitis. Frontiers in Pediatrics, 9, 1‒12. https://doi.org/10.3389/fped.2021.662183

Menteri Pendidikan dan Kebudayaan Republik Indonesia. (2020). Peraturan Menteri Pendidikan dan Kebudayaan Republik Indonesia Nomor 3 Tahun 2020 Tentang Standar Nasional Pendidikan Tinggi. Retrieved from https://jdih.kemdikbud.go.id/arsip/Salinan%20PERMENDIKBUD%203%20TAHUN%202020%20FIX%20GAB.pdf

Schonlau, M., & Zou, R. Y. (2020). The random forest algorithm for statistical learning. The Stata Journal, 20(1), 3–29. https://doi.org/10.1177/1536867X20909688

Städler, N., Stekhoven, D. J., & Bühlmann, P. (2014). Pattern alternating maximization algorithm for missing data in high-dimensional problems. Journal of Machine Learning Research, 15(1), 1903‒1928.

Stekhoven, D. J., & Bühlmann, P. (2012). MissForest—Non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112‒118.

Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., ... & Altman, R. B. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics, 17(6), 520-525.

Wang, C., & Wu, H. (2018). A new machine learning approach to house price estimation. New Trends in Mathematical Sciences, 6(4), 165‒171. https://doi.org/10.20852/ntmsci.2018.327

Yan, K. (2021). Student performance prediction using XGBoost method from a macro perspective. In 2021 2nd International Conference on Computing and Data Science (CDS) (pp. 453–459). IEEE. https://doi.org/10.1109/CDS52072.2021.00084

Yuliansyah, H., Imaniati, R. A. P., Wirasto, A., & Wibowo, M. (2021). Predicting students graduate on time using C4. 5 algorithm. Journal of Information Systems Engineering and Business Intelligence, 7(1), 67‒73.