Advancing Cross-Cultural Natural Language Processing with a Focus on Sundanese Language and Contextual Nuances

Authors

  • Anggi Muhammad Rifai Universitas Pelita Bangsa https://orcid.org/0000-0002-6199-2029
  • Ema Utami Universitas Amikom Yogyakarta
  • Amali Universitas Pelita Bangsa
  • Muhamad Fatchan Universitas Pelita Bangsa
  • Muhamad Ekhsan Universitas Pelita Bangsa

DOI:

https://doi.org/10.21512/commit.v20i1.13912

Keywords:

Cross-Cultural Natural Language Processing, Seq2Seq Transformer, Sundanese Language Translation, Low-Resource Language Processing

Abstract

The Sundanese language, as one of Indonesia’s regional tongues, holds deep cultural value but is still underrepresented in computational linguistics. The research addresses this gap by developing a translation model between Sundanese and Indonesian using a transformer-based sequence-to-sequence (Seq2Seq) architecture. With a parallel dataset of 3,616 sentence pairs, the model is fine-tuned to capture linguistic and contextual subtleties. The evaluation yields strong results: Bilingual Evaluation Understudy (BLEU) score of 44.12, Recall - Oriented Understudy for Gisting Evaluation (ROUGE)-1 F1-Score of 0.72, and ROUGE-L F1-Score of 0.71. Those demonstrate high translation quality despite limited data. Unlike earlier Sundanese translation studies that rely on Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), or standard transformer models, this research uniquely leverages the multilingual pretrained M2M100 Transformer, enabling transfer learning from high-resource languages to improve low-resource performance. These outcomes highlight the model’s potential for real-world applications, such as translation tools for education and cultural exchange. The research emphasizes the importance of improving access to Sundanese texts and promoting its digital presence to aid in language preservation. Overall, the research not only advances Natural Language Processing (NLP) research for low-resource languages but also reinforces the importance of integrating regional languages like Sundanese into modern technology. Building upon prior studies on Indonesian–Sundanese translation, the research novelty lies in fine-tuning a multilingual Seq2Seq Transformer that captures both linguistic and contextual nuances, thereby setting a new benchmark for lowresource language processing.

 

Dimensions

Author Biographies

Anggi Muhammad Rifai, Universitas Pelita Bangsa

Department of Informatics Engineering, Faculty of Engineering

Ema Utami, Universitas Amikom Yogyakarta

Doctoral Program in Informatics

Amali, Universitas Pelita Bangsa

Department of Informatics Engineering, Faculty of Engineering

Muhamad Fatchan, Universitas Pelita Bangsa

Department of Informatics Engineering, Faculty of Engineering

Muhamad Ekhsan, Universitas Pelita Bangsa

Department of Management, Faculty of Economics and Business

References

[1] D. Khurana, A. Koli, K. Khatter, and S. Singh, “Natural language processing: State of the art, current trends and challenges,” Multimedia Tools and Applications, vol. 82, no. 3, pp. 3713–3744, 2023.

[2] R. Ruiz-Dolz, J. Taverner, J. Lawrence, and C. Reed, “NLAS-multi: A multilingual corpus of automatically generated natural language argumentation schemes,” Data in Brief, vol. 57, pp. 1–10, 2024.

[3] N. Ahmed, A. K. Saha, M. A. Al Noman, J. R. Jim, M. F. Mridha, and M. M. Kabir, “Deep learning-based natural language processing in human–agent interaction: Applications, advancements and challenges,” Natural Language Processing Journal, vol. 9, pp. 1–25, 2024.

[4] Supriyono, A. P. Wibawa, Suyono, and F. Kurniawan, “Advancements in natural language processing: Implications, challenges, and future directions,” Telematics and Informatics Reports, vol. 16, pp. 1–17, 2024.

[5] H. Wang, J. Li, H. Wu, E. Hovy, and Y. Sun, “Pretrained language models and their applications,” Engineering, vol. 25, pp. 51–65, 2023.

[6] J. A. Ruip´erez-Valiente, T. Staubitz, M. Jenner, S. Halawa, J. Zhang, I. Despujol, J. Maldonado-Mahauad, G. Montoro, M. Peffer, T. Rohloff, J. Lane, C. Turro, X. Li, M. P´erez-Sanagust´ın, and J. Reich, “Large scale analytics of global and regional MOOC providers: Differences in learners’ demographics, preferences, and perceptions,” Computers & Education, vol. 180, pp. 1–17, 2022.

[7] Y. Liu and M. Dras, “Using corpora from natural language processing for investigating crosslinguistic influence,” Ampersand, vol. 12, pp. 1–13, 2024.

[8] W. Khan, A. Daud, K. Khan, S. Muhammad, and R. Haq, “Exploring the frontiers of deep learning and natural language processing: A comprehensive overview of key challenges and emerging trends,” Natural Language Processing Journal, vol. 4, pp. 1–31, 2023.

[9] N. Nnamoko, T. Karaminis, J. Procter, J. Barrowclough, and I. Korkontzelos, “Automatic language ability assessment method based on natural language processing,” Natural Language Processing Journal, vol. 8, pp. 1–16, 2024.

[10] S. Sarip, D. Fitriana, A. F. Azhari, A. Absori, E. K. Dewi, H. N. Adiantika, and N. Nurkhaeriyah, “Policy and linguistic considerations in the proposed renaming of West Java Province to Tatar Sunda,” Cepalo, vol. 8, no. 1, pp. 31–48, 2024.

[11] A. F. Aji, G. I. Winata, F. Koto, S. Cahyawijaya, A. Romadhony, R. Mahendra, K. Kurniawan, D. Moeljadi, R. E. Prasojo, T. Baldwin, J. H. Lau, and S. Ruder, “One country, 700+ languages: NLP challenges for underrepresented languages and dialects in Indonesia,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics, 2022, pp. 7226–7249.

[12] I. Widianingsih, J. J. McIntyre, U. S. Rakasiwi, G. H. Iskandar, and R. Wirawan, “Indigenous sundanese leadership: Eco-systemic lessons on zero emissions: A conversation with Indigenous leaders in Ciptagelar, West Java,” Systemic Practice and Action Research, vol. 36, no. 2, pp. 321–353, 2023.

[13] M. Javaid, A. Haleem, R. P. Singh, and R. Suman, “Artificial intelligence applications for Industry 4.0: A literature-based study,” Journal of Industrial Integration and Management, vol. 7, no. 01, pp. 83–111, 2022.

[14] M. Mager, E. Maier, K. von der Wense, and N. T. Vu, “Ethical considerations for machine translation of indigenous languages: Giving a voice to the speakers,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Toronto, Canada: Association for Computational Linguistics, 2023, pp. 4871–4897.

[15] H. Sujaini and A. B. Putra, “Analysis of language identification algorithms for regional Indonesian languages,” IAES International Journal of Artificial Intelligence (IJ-AI), vol. 13, no. 2, pp. 1741–1752, 2024.

[16] A. Montejo-R´aez, M. D. Molina-Gonz´alez, S. M. Jime´nez-Zafra, M. A´ . Garc´ıa-Cumbreras, and L. J. Garc´ıa-L´opez, “A survey on detecting mental disorders with natural language processing: Literature review, trends and challenges,” Computer Science Review, vol. 53, pp. 1–17, 2024.

[17] B. Masua and N. Masasi, “In the heart of Swahili: An exploration of data collection methods and corpus curation for natural language processing,” Data in Brief, vol. 55, pp. 1–9, 2024.

[18] B. Zhang, P. Williams, I. Titov, and R. Sennrich, “Improving massively multilingual neural machine translation and zero-shot translation,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, 2020, pp. 1628–1639.

[19] S. Cahyawijaya, H. Lovenia, F. Koto, D. Adhista, E. Dave, S. Oktavianti, S. Akbar, J. Lee, N. Shadieq, T. W. Cenggoro, H. Lunuwih, B. Wilie, G. Muridan, G. Winata, D. Moeljadi, A. F. Aji, A. Purwarianti, and P. Fung, “NusaWrites: Constructing high-quality corpora for underrepresented and extremely low-resource languages,” in Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). Nusa Dua, Bali: Association for Computational Linguistics, 2023, pp. 921–945.

[20] B. D. Wijanarko, Y. Heryadi, D. F. Murad, C. Tho, and K. Hashimoto, “Recurrent neural network-based models as Bahasa Indonesia-Sundanese language neural machine translator,” in 2023 International Conference on Computer Science, Information Technology and Engineering (ICCoSITE). Jakarta, Indonesia: IEEE, Feb. 16, 2023, pp. 951–956.

[21] E. A. Kusnanti, E. Sierra, G. G. S. Putra, E. S. Cahyadi, A. Haq, and D. Purwitasari, “Indonesian lexical ambiguity in machine translation: A literature review,” in 2024 International Conference on Information Technology Research and Innovation (ICITRI). Jakarta, Indonesia: IEEE, Sep. 5–6, 2024, pp. 59–64.

[22] A. Tambusai and K. Nasution, “A comparative typology of verbal affixes in Riau-Malay and Sundanese,” Indonesian Journal of Applied Linguistics, vol. 13, no. 3, pp. 636–647, 2024.

[23] G. L. A. Babu and S. Badugu, “Deep learning based sequence to sequence model for abstractive Telugu text summarization,” Multimedia Tools and Applications, vol. 82, no. 11, pp. 17 075–17 096, 2023.

[24] A. Rahali and M. A. Akhloufi, “End-to-end transformer-based models in textual-based NLP,” AI, vol. 4, no. 1, pp. 54–110, 2023.

[25] N. A. Al-Shameri and H. S. Al-Khalifa, “Arabic paraphrase generation using transformer-based approaches,” IEEE Access, vol. 12, pp. 121 896–121 914, 2024.

[26] P. Prasada and M. V. P. Rao, “Reinforcement of low-resource language translation with neural machine translation and backtranslation synergies,” IAES International Journal of Artificial Intelligence (IJ-AI), vol. 13, no. 3, pp. 3478–3488, 2024.

[27] P. U. Ogbogu, L. M. Noroski, K. Arcoleo, B. D. Reese Jr., and A. J. Apter, “Methods for crosscultural communication in clinic encounters,” The Journal of Allergy and Clinical Immunology: In Practice, vol. 10, no. 4, pp. 893–900, 2022.

[28] D. Peral-Garc´ıa, J. Cruz-Benito, and F. J. Garc´ıa-Pe˜nalvo, “Comparing natural language processing and quantum natural processing approaches in text classification tasks,” Expert Systems with Applications, vol. 254, pp. 1–9, 2024.

[29] D. W. Otter, J. R. Medina, and J. K. Kalita, “A survey of the usages of deep learning for natural language processing,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 2, pp. 604–624, 2020.

[30] M. Siino, I. Tinnirello, and M. La Cascia, “Is text preprocessing still worth the time? A comparative survey on the influence of popular preprocessing methods on transformers and traditional classifiers,” Information Systems, vol. 121, pp. 1–19, 2024.

[31] L. J. Laki and Z. G. Yang, “Neural machine translation for Hungarian,” Acta Linguistica Academica, vol. 69, no. 4, pp. 501–520, 2022.

[32] D. Roy, S. Fakhoury, and V. Arnaoudova, “Reassessing automatic evaluation metrics for code summarization tasks,” in Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York, United States: Association for Computing Machinery, Aug. 23–28, 2021, pp. 1105–1116.

[33] S. Cahyawijaya, G. I. Winata, B. Wilie, K. Vincentio, X. Li, A. Kuncoro, S. Ruder, Z. Y. Lim, S. Bahar, M. Khodra, A. Purwarianti, and P. Fung, “IndoNLG: Benchmark and resources for evaluating indonesian natural language generation,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, 2021, pp. 8875–8898.

[34] A. A. Suryani, D. H. Widyantoro, A. Purwarianti, and Y. Sudaryat, “Sundanese-Indonesian parallel corpus,” 2022. [Online]. Available: https://doi.org/10.34820/FK2/HDYWXW

[35] A. Kathuria, A. Gupta, and R. K. Singla, “A review of tools and techniques for preprocessing of textual data,” in Computational Methods and Data Engineering. Springer, 2020, pp. 407–422.

[36] R. Egger and E. Gokce, “Natural Language Processing (NLP): An introduction: Making sense of textual data,” in Applied data science in tourism: Interdisciplinary approaches, methodologies, and applications. Springer, 2022, pp. 307–334.

[37] H. Hamed, A. M. Helmy, and A. Mohammed, “Deep learning approach for translating Arabic Holy Quran into Italian language,” in 2021 International Mobile, Intelligent, and Ubiquitous Computing Conference (MIUCC). Cairo, Egypt: IEEE, May 26–27, 2021, pp. 193–199.

[38] H. K. Vydana, M. Karafi´at, K. Zmolikova, L. Burget, and H. Cˇ ernocky`, “Jointly trained transformers models for spoken language translation,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto, Canada: IEEE, June 6–11, 2021, pp. 7513–7517.

[39] M. Gupta and P. Kumar, “Robust neural language translation model formulation using Seq2Seq approach,” Fusion: Practice and Applications, vol. 5, no. 2, pp. 61–67, 2021.

[40] Z. Abidin, P. Permata, and F. Ariyani, “Translation of the Lampung language text dialect of Nyo into the Indonesian language with DMT and SMT approach,” INTENSIF: Jurnal Ilmiah Penelitian dan Penerapan Teknologi Sistem Informasi, vol. 5, no. 1, pp. 58–71, 2021.

Downloads

Published

2026-04-09

How to Cite

[1]
A. M. Rifai, E. Utami, A. Amali, M. Fatchan, and M. Ekhsan, “Advancing Cross-Cultural Natural Language Processing with a Focus on Sundanese Language and Contextual Nuances”, CommIT (Communication and Information Technology) Journal, vol. 20, no. 1, pp. 155–167, Apr. 2026.
Abstract 221  .
PDF downloaded 41  .