The Impact of Text Preprocessing in Sarcasm Detection on Indonesian Social Media Contents

Nicholaus Hendrik Jeremy

doi:10.21512/emacsjournal.v7i2.13503

Authors

Nicholaus Hendrik Jeremy Bina Nusantara University

DOI:

https://doi.org/10.21512/emacsjournal.v7i2.13503

Keywords:

Sarcasm Detection, IdSarcasm, Social Media, Preprocessing, Natural Language Processing

Abstract

Sarcasm is a way to convey something but delivered in the opposite way. This behavior is common on social media, where there are plenty of examples. On natural language processing, the task on its own is difficult primarily due to the lack of context. To add another layer of difficulty, communication in social media is done colloquially. One sacrasm benchmark, IdSarcasm, has alleviated one key issue in the development of sarcasm detection. However, there has not been an attempt to further preprocess the input before feeding them into the model. Pre-trained language models always use preprocessed corpus to ensure that the model is built upon quality dataset. Based on the current condition of IdSarcasm, further preprocessing step is necessary to ensure better quality. Specifically, the additional steps needed are handling HTML code, code-mixing, and colloquial writing which consists of shortened form, extended form, spelling variation, and reduplication. Several scenarios are created to observe the effect of additional preprocessing steps. Each additional preprocessing step is also tested to observe the effect of the preprocessing step independently. We prove that preprocessing step is still prevalent for data sourced from social media, and we recommend IndoNLU’s IndoBERT or large multilingual model to be used for sarcasm classification.

Dimensions

Author Biography

Nicholaus Hendrik Jeremy, Bina Nusantara University

Computer Science Department, School of Computer Science

References

Al Shlowiy, A. (2014). Texting abbreviations and language learning. International Journal of Arts & Sciences, 7(3), 455.

Alita, D., Priyanta, S., & Rokhman, N. (2019). Analysis of Emoticon and Sarcasm Effect on Sentiment Analysis of Indonesian Language on Twitter. Journal of Information Systems Engineering and Business Intelligence, 5, 100. https://doi.org/10.20473/jisebi.5.2.100-109

Caucci, G. M., & Kreuz, R. J. (2012). Social and paralinguistic cues to sarcasm. Humor, 25, 1–22. https://doi.org/10.1515/humor-2012-0001

Devianty, R. (2021). Penggunaan Kata Baku Dan Tidak Baku Dalam Bahasa Indonesia. EUNOIA (Jurnal Pendidikan Bahasa Indonesia), 1, 121. https://doi.org/10.30821/eunoia.v1i2.1136

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North, 4171–4186. https://doi.org/10.18653/v1/N19-1423

Dolamic, L., & Savoy, J. (2010). Brief communication: When stopword lists make the difference. Journal of the American Society for Information Science and Technology, 61(1), 200–203. https://doi.org/10.1002/asi.21186

Eke, C. I., Norman, A. A., & Shuib, L. (2021). Context-Based Feature Technique for Sarcasm Identification in Benchmark Datasets Using Deep Learning and BERT Model. IEEE Access, 9, 48501–48518. https://doi.org/10.1109/ACCESS.2021.3068323

Glenwright, M., & Pexman, P. M. (2010). Development of children’s ability to distinguish sarcasm and verbal irony. Journal of Child Language, 37, 429–451. https://doi.org/10.1017/S0305000909009520

Kemp, N. (2010). Texting versus txtng: Reading and writing text messages, and links with other linguistic skills. Writing Systems Research, 2, 53–71. https://doi.org/10.1093/wsr/wsq002

Khotijah, S., Tirtawangsa, J., & Suryani, A. A. (2020). Using LSTM for Context Based Approach of Sarcasm Detection in Twitter. ACM International Conference Proceeding Series. https://doi.org/10.1145/3406601.3406624/SUPPL_FILE/A19-KHOTIJAH-CORRIGENDUM.PDF

Matarazzo, A., & Torlone, R. (2025). A Survey on Large Language Models with some Insights on their Capabilities and Limitations.

Misra, R., & Arora, P. (2023). Sarcasm detection using news headlines dataset. AI Open, 4, 13–18. https://doi.org/10.1016/J.AIOPEN.2023.01.001

Rahayu, D. A. P., Kuntur, S., & Hayatin, N. (2018). Sarcasm detection on Indonesian twitter feeds. International Conference on Electrical Engineering, Computer Science and Informatics (EECSI), 2018-October, 137–141. https://doi.org/10.1109/EECSI.2018.8752913

Rosid, M. A., Siahaan, D., & Saikhu, A. (2024). Sarcasm Detection in Indonesian-English Code-Mixed Text Using Multihead Attention-Based Convolutional and Bi-Directional GRU. IEEE Access. https://doi.org/10.1109/ACCESS.2024.3436107

Sarica, S., & Luo, J. (2021). Stopwords in technical language processing. PLoS ONE, 16. https://doi.org/10.1371/journal.pone.0254937

Suárez, P. J. O., Romary, L., & Sagot, B. (2020). A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages. Proceedings of the Annual Meeting of the Association for Computational Linguistics, 1703–1714. https://doi.org/10.18653/v1/2020.acl-main.156

Suhartono, D., Wongso, W., & Tri Handoyo, A. (2024). IdSarcasm: Benchmarking and Evaluating Language Models for Indonesian Sarcasm Detection. IEEE Access, 12, 87323–87332. https://doi.org/10.1109/ACCESS.2024.3416955

Toplak, M., & Katz, A. N. (2000). On the uses of sarcastic irony. Journal of Pragmatics, 32, 1467–1488. https://doi.org/10.1016/S0378-2166(99)00101-0

Wang, Z., Wu, Z., Wang, R., & Ren, Y. (2015). Twitter sarcasm detection exploiting a context-based model. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 9418, 77–91. https://doi.org/10.1007/978-3-319-26190-4_6

Wilie, B., Vincentio, K., Winata, G. I., Cahyawijaya, S., Li, X., Lim, Z. Y., Soleman, S., Mahendra, R., Fung, P., Bahar, S., & Purwarianti, A. (2020). IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding. Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, 843–857. https://doi.org/10.18653/v1/2020.aacl-main.85

Zhu, N., & Wang, Z. (2020). The paradox of sarcasm: Theory of mind and sarcasm use in adults. Personality and Individual Differences, 163. https://doi.org/10.1016/j.paid.2020.110035