Prediction of Undergraduate Student’s Study Completion Status Using MissForest Imputation in Random Forest and XGBoost Models
Keywords:study completion status, MissForest imputation, Random-Forest model, XGBoost model
The number of higher education graduates in Indonesia is calculated based on their completion status. However, many undergraduate students have reached the maximum length of study, but their completion status is unknown. This condition becomes a problem in calculating the actual number of graduates as it is used as an indicator of higher education evaluation and other policy references. Therefore, the unknown completion status of the students who have reached the maximum length of study must be predicted. The research compared the performance of Random Forest and Extreme Gradient Boosting (XGBoost) classification models in predicting the unknown completion status. The research used a dataset containing 13.377 undergraduate students’ profiles from the Higher Education Database (PDDikti), Ministry of Education, Culture, Research, and Technology. The dataset was incomplete, and the proportion of missing data was 20,9% of the total data. Because missing data might lead to prediction bias, the research also used MissForest imputation to overcome the missing data in the classification modelling and compared it to Mean/Mode and Median/Mode imputation. The results show that MissForest outperforms the other two imputations in both classifiers but requires the longest computation time. Furthermore, the XGBoost model with MissForest is significantly superior to the Random Forest model with MissForest. Hence, the best model chosen to predict the completion status is XGBoost with MissForest imputation.
Acuña, E., & Rodriguez, C. (2004). The treatment of missing values and its effect on classifier accuracy. In Classification, clustering, and data mining applications (pp. 639−647). Berlin: Springer.
Ahmad, M. W., Mourshed, M., & Rezgui, Y. (2018). Tree-based ensemble methods for predicting PV power generation and their comparison with support vector regression. Energy, 164, 465–474. https://doi.org/10.1016/j.energy.2018.08.207
Alsaber, A. R., Pan, J., & Al-Hurban, A. (2021). Handling complex missing data using random forest approach for an air quality monitoring dataset: A case study of Kuwait environmental data (2012 to 2018). International Journal of Environmental Research and Public Health, 18(3), 1–25. https://doi.org/10.3390/ijerph18031333
Alturki, S., Hulpuș, I., & Stuckenschmidt, H. (2020). Predicting academic outcomes: A survey from 2007 till 2018. Technology, Knowledge and Learning, 27, 275–307. https://doi.org/10.1007/s10758-020-09476-0
Aminu, A. A., Abdulkarim, A., Aliyu, A. Y., Aliyu, M., & Turaki, A. M. (2019). Detection of phishing websites using Random Forest and XGBoost algorithms. International Journal of Pure and Applied Sciences, 2(3), 1–14.
Anwar, M. T., Winarno, E., Hadikurniawati, W., & Novita, M. (2021). Rainfall prediction using Extreme Gradient Boosting. Journal of Physics: Conference Series, 1869, 1–5. https://doi.org/10.1088/1742-6596/1869/1/012078
Baruah, E. A., Baruah, S., & Goswami, J. A. (2020). Comparative analysis of different classification algorithms based on students’ academic performance using WEKA. IOSR Journal of Computer Engineering (IOSR-JCE), 22(1), 49–56.
Blazek, K., Zwieten, A. V., Saglimbene, V., & Teixeira-Pinto, A. (2021). A practical guide to multiple imputation of missing data in nephrology. Kidney International, 99(1), 68–74. https://doi.org/10.1016/j.kint.2020.07.035
Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5−32. https://doi.org/10.1023/A:1010933404324
Cihan, P., Kalıpsız, O., & Gökçe, E. (2019). Effect of imputation methods in the classifier performance. Sakarya University Journal of Science, 23(6), 1225–1236.
Costa, F. J. D., Bispo, M. D. S., & Pereira, R. D. C. D. F. (2018). Dropout and retention of undergraduate students in management: A study at a Brazilian Federal University. RAUSP Management Journal, 53(1), 74–85 https://doi.org/10.1016/j.rauspm.2017.12.007
Hussain, S., Dahan, N. A., Ba-Alwib, F. M., & Ribata, N. (2018). Educational data mining and analysis of students’ academic performance using WEKA. Indonesian Journal of Electrical Engineering and Computer Science, 9(2), 447–459.
Khan, F. U. F., Khan, K. U. Z., & Singh, S. K. (2018). Is Group Means imputation any better than Mean imputation: A study using C5.0 classifier. Journal of Physics: Conference Series, 1060, 1‒5. https://doi.org/10.1088/1742-6596/1060/1/012014
Khan, S. I., & Hoque, A. S. M. L. (2020). SICE: An improved missing data imputation technique. Journal of Big Data, 7(1), 1–21.
Kim, A. S. N., Shakory, S., Azad, A., Popovic, C., & Park, L. (2020). Understanding the impact of attendance and participation on academic achievement. Scholarship of Teaching and Learning in Psychology, 6(4), 272–284. https://doi.org/10.1037/STL0000151
Kokla, M., Virtanen, J., Kolehmainen, M., Paananen, J., & Hanhineva, K. (2019). Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: A comparative study. BMC Bioinformatics, 20, 1−10. https://doi.org/10.1186/s12859-019-3110-0
Köse, T., Özgür, S., Coşgun, E., Keskinoğlu, A., & Keskinoğlu, P. (2020). Effect of missing data imputation on deep learning prediction performance for vesicoureteral reflux and recurrent urinary tract infection clinical study. BioMed Research International, 2020, 1‒15. https://doi.org/10.1155/2020/1895076
Kurniawan, D., Anggrawan, A., & Hairani. (2020). Graduation prediction system on students using C4.5 algorithm. MATRIK: Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer, 19(2), 358‒366. https://doi.org/10.30812/matrik.v19i2.685
Manimekalai, K., & Kavitha, A. (2018). Missing value imputation and normalization techniques in myocardial infarction. ICTACT Journal on Soft Computing, 8(03), 1655‒1662.
Marcinkevics, R., Reis Wolfertstetter, P., Wellmann, S., Knorr, C., & Vogt, J. E. (2021). Using machine learning to predict the diagnosis, management and severity of pediatric appendicitis. Frontiers in Pediatrics, 9, 1‒12. https://doi.org/10.3389/fped.2021.662183
Menteri Pendidikan dan Kebudayaan Republik Indonesia. (2020). Peraturan Menteri Pendidikan dan Kebudayaan Republik Indonesia Nomor 3 Tahun 2020 Tentang Standar Nasional Pendidikan Tinggi. Retrieved from https://jdih.kemdikbud.go.id/arsip/Salinan%20PERMENDIKBUD%203%20TAHUN%202020%20FIX%20GAB.pdf
Schonlau, M., & Zou, R. Y. (2020). The random forest algorithm for statistical learning. The Stata Journal, 20(1), 3–29. https://doi.org/10.1177/1536867X20909688
Städler, N., Stekhoven, D. J., & Bühlmann, P. (2014). Pattern alternating maximization algorithm for missing data in high-dimensional problems. Journal of Machine Learning Research, 15(1), 1903‒1928.
Stekhoven, D. J., & Bühlmann, P. (2012). MissForest—Non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112‒118.
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., ... & Altman, R. B. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics, 17(6), 520-525.
Wang, C., & Wu, H. (2018). A new machine learning approach to house price estimation. New Trends in Mathematical Sciences, 6(4), 165‒171. https://doi.org/10.20852/ntmsci.2018.327
Yan, K. (2021). Student performance prediction using XGBoost method from a macro perspective. In 2021 2nd International Conference on Computing and Data Science (CDS) (pp. 453–459). IEEE. https://doi.org/10.1109/CDS52072.2021.00084
Yuliansyah, H., Imaniati, R. A. P., Wirasto, A., & Wibowo, M. (2021). Predicting students graduate on time using C4. 5 algorithm. Journal of Information Systems Engineering and Business Intelligence, 7(1), 67‒73.
Copyright (c) 2022 Intan Nirmala, Prof. Dr. Ir. Hari Wijayanto, M. Si., Prof. Dr. Ir. Khairil Anwar Notodiputro, M. Si.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
a. Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License - Share Alike that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
b. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
c. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.
All articles published Open Access will be immediately and permanently free for everyone to read and download. We are continuously working with our author communities to select the best choice of license options, currently being defined for this journal as follows: