Cross-Prompt Based Automatic Short Answer Grading System

Lucia Dwi Krisnawati; Aditya Wikan Mahastama; Su Cheng Haw

doi:10.21512/commit.v19i2.13423

Authors

Lucia Dwi Krisnawati Universitas Kristen Duta Wacana
Aditya Wikan Mahastama Universitas Kristen Duta Wacana
Su Cheng Haw Multimedia University

DOI:

https://doi.org/10.21512/commit.v19i2.13423

Keywords:

Cross-Prompt, Automatic Short Answer Grading (ASAG), Prompt-Specific

Abstract

Research on Automatic Short Answer Grading (ASAG) has shown promising results in recent years. However, several important research gaps remain. Based on the literature review, the researchers identify two critical issues. First, the majority of ASAG models are trained and tested on responses to the same prompt which raises concerns about their robustness accross different prompts. Second, many existing approaches typically treat grading task as a binary classification problem. The research aims to bridge these gaps by developing an ASAG system that closely reflects real-world assessment scenarios through multiclass classification approach and cross-prompt evaluation. It is implemented by training the proposed models on 1,505 responses across 9 prompts and testing on 175 responses from 3 distinct prompts. The grading task is addressed using regression and classification techniques, including Linear Regression, Logistic Regression, Extreme Gradient Boosting (Xg-Boost), Adaptive Boosting (AdaBoost), and K-Nearest Neighbors (as a baseline). The grades are categorized into five classes that are represented by grade A to E. Both manual and algorithmic data augmentation techniques, including Syntactic Minority Oversampling Technique (SMOTE), are employed to address class imbalance in the sample data. Across multiple testing scenarios, all five models demonstrate consistent performance, with Linear Regression outperforming others. During the validation process, it achieves a high accuracy of 0.93, indicating its ability to correctly classify the responses. In the testing phase, it achieves a weighted F1-Score of 0.79, a macroaveraged F1-Score of 0.75, and an RMSE of 0.45. The results suggest relatively low prediction error.

Dimensions

Author Biographies

Lucia Dwi Krisnawati, Universitas Kristen Duta Wacana

Informatics Department, Faculty of Information Technology

Aditya Wikan Mahastama, Universitas Kristen Duta Wacana

Informatics Department, Faculty of Information Technology

Su Cheng Haw, Multimedia University

Faculty of Computing and Informatics

References

[1] D. Ifenthaler, “Automated essay scoring system,” in Handbook of open, distance and digital education. Singapore: Springer, 2022, pp. 1057–1071.

[2] S. Burrows, I. Gurevych, and B. Stein, “The eras and trends of automatic short answer grading,” International Journal of Artificial Intelligence in Education, vol. 25, no. 1, pp. 60–117, 2015.

[3] A. Sahu and P. K. Bhowmick, “Feature engineering and ensemble-based approach for improving automatic short-answer grading performance,” IEEE Transactions on Learning Technologies, vol. 13, no. 1, pp. 77–90, 2019.

[4] B. Cho, Y. Jang, and J. Yoon, “Rubricspecific approach to automated essay scoring with augmentation training,” 2023. [Online]. Available: https://arxiv.org/abs/2309.02740

[5] E. Del Gobbo, A. Guarino, B. Cafarelli, and L. Grilli, “GradeAid: A framework for automatic short answers grading in educational contexts—Design, implementation and evaluation,” Knowledge and Information Systems, vol. 65, no. 10, pp. 4295–4334, 2023.

[6] A. A. Septiandri, Y. A. Winatmoko, and I. F. Putra, “Knowing right from wrong: Should we use more complex models for automatic short-answer scoring in Bahasa Indonesia?” in Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing. Online: Association for Computational Linguistics, November 2020, pp. 1–7.

[7] M. R. R. Susanto, H. Thamrin, and N. A. Verdikha, “Performance of text similarity algorithms for essay answer scoring in online examinations,” Jurnal Teknik Informatika (JUTIF), vol. 4, no. 6, pp. 1515–1521, 2023.

[8] R. A. Rajagede, “Improving automatic essay scoring for indonesian language using simpler model and richer feature,” Kinetik: Game technology, Information System, Computer Network, Computing, Electronics, and Control, vol. 6, no. 1, pp. 11–18, 2021.

[9] L. Zhang, Y. Huang, X. Yang, S. Yu, and F. Zhuang, “An automatic short-answer grading model for semi-open-ended questions,” Interactive Learning Environments, vol. 30, no. 1, pp. 177–190, 2022.

[10] J. Liu, “Importance-SMOTE: A synthetic minority oversampling method for noisy imbalanced data,” Soft Computing, vol. 26, no. 3, pp. 1141–1163, 2022.

[11] N. U. Niaz, K. M. N. Shahariar, and M. J. A. Patwary, “Class imbalance problems in machine learning: A review of methods and future challenges,” in Proceedings of the 2nd International Conference on Computing Advancements, 2022, pp. 485–490.

[12] F. F. Lubis, A. Putri, D. Waskita, T. Sulistyaningtyas, A. A. Arman, and Y. Rosmansyah, “Automated short-answer grading using semantic similarity based on word embedding,” International Journal of Technology, vol. 12, no. 3, pp. 571–581, 2021.

[13] M. Chen and Y. Dong, “Design of exercise grading system based on text similarity computing,” Mobile Information Systems, vol. 2022, pp. 1–7, 2022.

[14] L. D. Krisnawati, A. W. Mahastama, S. C. Haw, K. W. Ng, and P. Naveen, “Indonesian-English textual similarity detection using Universal Sentence Encoder (USE) and Facebook AI Similarity Search (FAISS),” CommIT (Communication and Information Technology) Journal, vol. 18, no. 2, pp. 183–195, 2024.

[15] U. Hasanah, T. Astuti, R. Wahyudi, Z. Rifai, and R. A. Pambudi, “An experimental study of text preprocessing techniques for automatic short answer grading in Indonesian,” in 2018 3rd International Conference on Information Technology, Information System and Electrical Engineering (ICITISEE). Yogyakarta, Indonesia: IEEE, Nov. 13–14, 2018, pp. 230–234.

[16] N. H. Hameed and A. T. Sadiq, “Automatic short answer grading system based on semantic networks and support vector machine,” Iraqi Journal of Science, vol. 64, no. 11, pp. 6025–6040, 2023.

[17] D. Wilianto and A. S. Girsang, “Automatic short answer grading on high school’s e-learning using semantic similarity methods,” TEM Journal, vol. 12, no. 1, pp. 297–302, 2023.

[18] F. Li, X. Xi, Z. Cui, D. Li, and W. Zeng, “Automatic essay scoring method based on multiscale features,” Applied Sciences, vol. 13, no. 11, pp. 1–18, 2023.

[19] J. S. Tan, I. K. T. Tan, L. K. Soon, and H. F. Ong, “Improved automated essay scoring using Gaussian multi-class SMOTE for dataset sampling,” in Proceedings of the 15th International Conference on Educational Data Mining, England, UK, July 24–27, 2022.

[20] M. Tornqvist, M. Mahamud, E. M. Guzman, and A. Farazouli, “ExASAG: Explainable framework for automatic short answer grading,” in Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023). Toronto, Canada: Association for Computational Linguistics, July 2023, pp. 361–371.

[21] H. Funayama, Y. Asazuma, Y. Matsubayashi, T. Mizumoto, and K. Inui, “Reducing the cost: Cross-prompt pre-finetuning for short answer scoring,” in International Conference on Artificial Intelligence in Education. Tokyo, Japan: Springer, July 3–7, 2023, pp. 78–89.

[22] H. Do, Y. Kim, and G. G. Lee, “Promptand trait relation-aware cross-prompt essay trait scoring,” 2023. [Online]. Available: https://arxiv.org/abs/2305.16826

[23] R. Ridley, L. He, X. Y. Dai, S. Huang, and J. Chen, “Automated cross-prompt scoring of essay traits,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 15. Virtual: Association for the Advancement of Artificial Intelligence, Feb. 2–9, 2021, pp. 13 745–13 753.

[24] Tim Kurikulum, Panduan akademik kurikulum 2021 revisi 2023. Fakultas Teknologi Informasi, Universitas Kristen Duta Wacana, 2023.

[25] T. O. Hodson, “Root Mean Square Error (RMSE) or Mean Absolute Error (MAE): When to use them or not,” Geoscientific Model Development Discussions, vol. 2022, pp. 1–10, 2022.

[26] D. Chicco, M. J. Warrens, and G. Jurman, “The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation,” PeerJ Computer Science, vol. 7, 2021.