The Impact of Text Preprocessing in Sarcasm Detection on Indonesian Social Media Contents
DOI:
https://doi.org/10.21512/emacsjournal.v7i2.13503Keywords:
Sarcasm Detection, IdSarcasm, Social Media, Preprocessing, Natural Language ProcessingAbstract
Sarcasm is a way to convey something but delivered in the opposite way. This behavior is common on social media, where there are plenty of examples. On natural language processing, the task on its own is difficult primarily due to the lack of context. To add another layer of difficulty, communication in social media is done colloquially. One sacrasm benchmark, IdSarcasm, has alleviated one key issue in the development of sarcasm detection. However, there has not been an attempt to further preprocess the input before feeding them into the model. Pre-trained language models always use preprocessed corpus to ensure that the model is built upon quality dataset. Based on the current condition of IdSarcasm, further preprocessing step is necessary to ensure better quality. Specifically, the additional steps needed are handling HTML code, code-mixing, and colloquial writing which consists of shortened form, extended form, spelling variation, and reduplication. Several scenarios are created to observe the effect of additional preprocessing steps. Each additional preprocessing step is also tested to observe the effect of the preprocessing step independently. We prove that preprocessing step is still prevalent for data sourced from social media, and we recommend IndoNLU’s IndoBERT or large multilingual model to be used for sarcasm classification.
Plum Analytics
References
Al Shlowiy, A. (2014). Texting abbreviations and language learning. International Journal of Arts & Sciences, 7(3), 455.
Alita, D., Priyanta, S., & Rokhman, N. (2019). Analysis of Emoticon and Sarcasm Effect on Sentiment Analysis of Indonesian Language on Twitter. Journal of Information Systems Engineering and Business Intelligence, 5, 100. https://doi.org/10.20473/jisebi.5.2.100-109
Caucci, G. M., & Kreuz, R. J. (2012). Social and paralinguistic cues to sarcasm. Humor, 25, 1–22. https://doi.org/10.1515/humor-2012-0001
Devianty, R. (2021). Penggunaan Kata Baku Dan Tidak Baku Dalam Bahasa Indonesia. EUNOIA (Jurnal Pendidikan Bahasa Indonesia), 1, 121. https://doi.org/10.30821/eunoia.v1i2.1136
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North, 4171–4186. https://doi.org/10.18653/v1/N19-1423
Dolamic, L., & Savoy, J. (2010). Brief communication: When stopword lists make the difference. Journal of the American Society for Information Science and Technology, 61(1), 200–203. https://doi.org/10.1002/asi.21186
Eke, C. I., Norman, A. A., & Shuib, L. (2021). Context-Based Feature Technique for Sarcasm Identification in Benchmark Datasets Using Deep Learning and BERT Model. IEEE Access, 9, 48501–48518. https://doi.org/10.1109/ACCESS.2021.3068323
Glenwright, M., & Pexman, P. M. (2010). Development of children’s ability to distinguish sarcasm and verbal irony. Journal of Child Language, 37, 429–451. https://doi.org/10.1017/S0305000909009520
Kemp, N. (2010). Texting versus txtng: Reading and writing text messages, and links with other linguistic skills. Writing Systems Research, 2, 53–71. https://doi.org/10.1093/wsr/wsq002
Khotijah, S., Tirtawangsa, J., & Suryani, A. A. (2020). Using LSTM for Context Based Approach of Sarcasm Detection in Twitter. ACM International Conference Proceeding Series. https://doi.org/10.1145/3406601.3406624/SUPPL_FILE/A19-KHOTIJAH-CORRIGENDUM.PDF
Matarazzo, A., & Torlone, R. (2025). A Survey on Large Language Models with some Insights on their Capabilities and Limitations.
Misra, R., & Arora, P. (2023). Sarcasm detection using news headlines dataset. AI Open, 4, 13–18. https://doi.org/10.1016/J.AIOPEN.2023.01.001
Rahayu, D. A. P., Kuntur, S., & Hayatin, N. (2018). Sarcasm detection on Indonesian twitter feeds. International Conference on Electrical Engineering, Computer Science and Informatics (EECSI), 2018-October, 137–141. https://doi.org/10.1109/EECSI.2018.8752913
Rosid, M. A., Siahaan, D., & Saikhu, A. (2024). Sarcasm Detection in Indonesian-English Code-Mixed Text Using Multihead Attention-Based Convolutional and Bi-Directional GRU. IEEE Access. https://doi.org/10.1109/ACCESS.2024.3436107
Sarica, S., & Luo, J. (2021). Stopwords in technical language processing. PLoS ONE, 16. https://doi.org/10.1371/journal.pone.0254937
Suárez, P. J. O., Romary, L., & Sagot, B. (2020). A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages. Proceedings of the Annual Meeting of the Association for Computational Linguistics, 1703–1714. https://doi.org/10.18653/v1/2020.acl-main.156
Suhartono, D., Wongso, W., & Tri Handoyo, A. (2024). IdSarcasm: Benchmarking and Evaluating Language Models for Indonesian Sarcasm Detection. IEEE Access, 12, 87323–87332. https://doi.org/10.1109/ACCESS.2024.3416955
Toplak, M., & Katz, A. N. (2000). On the uses of sarcastic irony. Journal of Pragmatics, 32, 1467–1488. https://doi.org/10.1016/S0378-2166(99)00101-0
Wang, Z., Wu, Z., Wang, R., & Ren, Y. (2015). Twitter sarcasm detection exploiting a context-based model. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 9418, 77–91. https://doi.org/10.1007/978-3-319-26190-4_6
Wilie, B., Vincentio, K., Winata, G. I., Cahyawijaya, S., Li, X., Lim, Z. Y., Soleman, S., Mahendra, R., Fung, P., Bahar, S., & Purwarianti, A. (2020). IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding. Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, 843–857. https://doi.org/10.18653/v1/2020.aacl-main.85
Zhu, N., & Wang, Z. (2020). The paradox of sarcasm: Theory of mind and sarcasm use in adults. Personality and Individual Differences, 163. https://doi.org/10.1016/j.paid.2020.110035
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Nicholaus Hendrik Jeremy

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License - Share Alike that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.
USER RIGHTS
All articles published Open Access will be immediately and permanently free for everyone to read and download. We are continuously working with our author communities to select the best choice of license options, currently being defined for this journal as follows: Creative Commons Attribution-Share Alike (CC BY-SA)