Advancing Cross-Cultural Natural Language Processing with a Focus on Sundanese Language and Contextual Nuances
DOI:
https://doi.org/10.21512/commit.v20i1.13912Keywords:
Cross-Cultural Natural Language Processing, Seq2Seq Transformer, Sundanese Language Translation, Low-Resource Language ProcessingAbstract
The Sundanese language, as one of Indonesia’s regional tongues, holds deep cultural value but is still underrepresented in computational linguistics. The research addresses this gap by developing a translation model between Sundanese and Indonesian using a transformer-based sequence-to-sequence (Seq2Seq) architecture. With a parallel dataset of 3,616 sentence pairs, the model is fine-tuned to capture linguistic and contextual subtleties. The evaluation yields strong results: Bilingual Evaluation Understudy (BLEU) score of 44.12, Recall - Oriented Understudy for Gisting Evaluation (ROUGE)-1 F1-Score of 0.72, and ROUGE-L F1-Score of 0.71. Those demonstrate high translation quality despite limited data. Unlike earlier Sundanese translation studies that rely on Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), or standard transformer models, this research uniquely leverages the multilingual pretrained M2M100 Transformer, enabling transfer learning from high-resource languages to improve low-resource performance. These outcomes highlight the model’s potential for real-world applications, such as translation tools for education and cultural exchange. The research emphasizes the importance of improving access to Sundanese texts and promoting its digital presence to aid in language preservation. Overall, the research not only advances Natural Language Processing (NLP) research for low-resource languages but also reinforces the importance of integrating regional languages like Sundanese into modern technology. Building upon prior studies on Indonesian–Sundanese translation, the research novelty lies in fine-tuning a multilingual Seq2Seq Transformer that captures both linguistic and contextual nuances, thereby setting a new benchmark for lowresource language processing.
References
[1] D. Khurana, A. Koli, K. Khatter, and S. Singh, “Natural language processing: State of the art, current trends and challenges,” Multimedia Tools and Applications, vol. 82, no. 3, pp. 3713–3744, 2023.
[2] R. Ruiz-Dolz, J. Taverner, J. Lawrence, and C. Reed, “NLAS-multi: A multilingual corpus of automatically generated natural language argumentation schemes,” Data in Brief, vol. 57, pp. 1–10, 2024.
[3] N. Ahmed, A. K. Saha, M. A. Al Noman, J. R. Jim, M. F. Mridha, and M. M. Kabir, “Deep learning-based natural language processing in human–agent interaction: Applications, advancements and challenges,” Natural Language Processing Journal, vol. 9, pp. 1–25, 2024.
[4] Supriyono, A. P. Wibawa, Suyono, and F. Kurniawan, “Advancements in natural language processing: Implications, challenges, and future directions,” Telematics and Informatics Reports, vol. 16, pp. 1–17, 2024.
[5] H. Wang, J. Li, H. Wu, E. Hovy, and Y. Sun, “Pretrained language models and their applications,” Engineering, vol. 25, pp. 51–65, 2023.
[6] J. A. Ruip´erez-Valiente, T. Staubitz, M. Jenner, S. Halawa, J. Zhang, I. Despujol, J. Maldonado-Mahauad, G. Montoro, M. Peffer, T. Rohloff, J. Lane, C. Turro, X. Li, M. P´erez-Sanagust´ın, and J. Reich, “Large scale analytics of global and regional MOOC providers: Differences in learners’ demographics, preferences, and perceptions,” Computers & Education, vol. 180, pp. 1–17, 2022.
[7] Y. Liu and M. Dras, “Using corpora from natural language processing for investigating crosslinguistic influence,” Ampersand, vol. 12, pp. 1–13, 2024.
[8] W. Khan, A. Daud, K. Khan, S. Muhammad, and R. Haq, “Exploring the frontiers of deep learning and natural language processing: A comprehensive overview of key challenges and emerging trends,” Natural Language Processing Journal, vol. 4, pp. 1–31, 2023.
[9] N. Nnamoko, T. Karaminis, J. Procter, J. Barrowclough, and I. Korkontzelos, “Automatic language ability assessment method based on natural language processing,” Natural Language Processing Journal, vol. 8, pp. 1–16, 2024.
[10] S. Sarip, D. Fitriana, A. F. Azhari, A. Absori, E. K. Dewi, H. N. Adiantika, and N. Nurkhaeriyah, “Policy and linguistic considerations in the proposed renaming of West Java Province to Tatar Sunda,” Cepalo, vol. 8, no. 1, pp. 31–48, 2024.
[11] A. F. Aji, G. I. Winata, F. Koto, S. Cahyawijaya, A. Romadhony, R. Mahendra, K. Kurniawan, D. Moeljadi, R. E. Prasojo, T. Baldwin, J. H. Lau, and S. Ruder, “One country, 700+ languages: NLP challenges for underrepresented languages and dialects in Indonesia,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics, 2022, pp. 7226–7249.
[12] I. Widianingsih, J. J. McIntyre, U. S. Rakasiwi, G. H. Iskandar, and R. Wirawan, “Indigenous sundanese leadership: Eco-systemic lessons on zero emissions: A conversation with Indigenous leaders in Ciptagelar, West Java,” Systemic Practice and Action Research, vol. 36, no. 2, pp. 321–353, 2023.
[13] M. Javaid, A. Haleem, R. P. Singh, and R. Suman, “Artificial intelligence applications for Industry 4.0: A literature-based study,” Journal of Industrial Integration and Management, vol. 7, no. 01, pp. 83–111, 2022.
[14] M. Mager, E. Maier, K. von der Wense, and N. T. Vu, “Ethical considerations for machine translation of indigenous languages: Giving a voice to the speakers,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Toronto, Canada: Association for Computational Linguistics, 2023, pp. 4871–4897.
[15] H. Sujaini and A. B. Putra, “Analysis of language identification algorithms for regional Indonesian languages,” IAES International Journal of Artificial Intelligence (IJ-AI), vol. 13, no. 2, pp. 1741–1752, 2024.
[16] A. Montejo-R´aez, M. D. Molina-Gonz´alez, S. M. Jime´nez-Zafra, M. A´ . Garc´ıa-Cumbreras, and L. J. Garc´ıa-L´opez, “A survey on detecting mental disorders with natural language processing: Literature review, trends and challenges,” Computer Science Review, vol. 53, pp. 1–17, 2024.
[17] B. Masua and N. Masasi, “In the heart of Swahili: An exploration of data collection methods and corpus curation for natural language processing,” Data in Brief, vol. 55, pp. 1–9, 2024.
[18] B. Zhang, P. Williams, I. Titov, and R. Sennrich, “Improving massively multilingual neural machine translation and zero-shot translation,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, 2020, pp. 1628–1639.
[19] S. Cahyawijaya, H. Lovenia, F. Koto, D. Adhista, E. Dave, S. Oktavianti, S. Akbar, J. Lee, N. Shadieq, T. W. Cenggoro, H. Lunuwih, B. Wilie, G. Muridan, G. Winata, D. Moeljadi, A. F. Aji, A. Purwarianti, and P. Fung, “NusaWrites: Constructing high-quality corpora for underrepresented and extremely low-resource languages,” in Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). Nusa Dua, Bali: Association for Computational Linguistics, 2023, pp. 921–945.
[20] B. D. Wijanarko, Y. Heryadi, D. F. Murad, C. Tho, and K. Hashimoto, “Recurrent neural network-based models as Bahasa Indonesia-Sundanese language neural machine translator,” in 2023 International Conference on Computer Science, Information Technology and Engineering (ICCoSITE). Jakarta, Indonesia: IEEE, Feb. 16, 2023, pp. 951–956.
[21] E. A. Kusnanti, E. Sierra, G. G. S. Putra, E. S. Cahyadi, A. Haq, and D. Purwitasari, “Indonesian lexical ambiguity in machine translation: A literature review,” in 2024 International Conference on Information Technology Research and Innovation (ICITRI). Jakarta, Indonesia: IEEE, Sep. 5–6, 2024, pp. 59–64.
[22] A. Tambusai and K. Nasution, “A comparative typology of verbal affixes in Riau-Malay and Sundanese,” Indonesian Journal of Applied Linguistics, vol. 13, no. 3, pp. 636–647, 2024.
[23] G. L. A. Babu and S. Badugu, “Deep learning based sequence to sequence model for abstractive Telugu text summarization,” Multimedia Tools and Applications, vol. 82, no. 11, pp. 17 075–17 096, 2023.
[24] A. Rahali and M. A. Akhloufi, “End-to-end transformer-based models in textual-based NLP,” AI, vol. 4, no. 1, pp. 54–110, 2023.
[25] N. A. Al-Shameri and H. S. Al-Khalifa, “Arabic paraphrase generation using transformer-based approaches,” IEEE Access, vol. 12, pp. 121 896–121 914, 2024.
[26] P. Prasada and M. V. P. Rao, “Reinforcement of low-resource language translation with neural machine translation and backtranslation synergies,” IAES International Journal of Artificial Intelligence (IJ-AI), vol. 13, no. 3, pp. 3478–3488, 2024.
[27] P. U. Ogbogu, L. M. Noroski, K. Arcoleo, B. D. Reese Jr., and A. J. Apter, “Methods for crosscultural communication in clinic encounters,” The Journal of Allergy and Clinical Immunology: In Practice, vol. 10, no. 4, pp. 893–900, 2022.
[28] D. Peral-Garc´ıa, J. Cruz-Benito, and F. J. Garc´ıa-Pe˜nalvo, “Comparing natural language processing and quantum natural processing approaches in text classification tasks,” Expert Systems with Applications, vol. 254, pp. 1–9, 2024.
[29] D. W. Otter, J. R. Medina, and J. K. Kalita, “A survey of the usages of deep learning for natural language processing,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 2, pp. 604–624, 2020.
[30] M. Siino, I. Tinnirello, and M. La Cascia, “Is text preprocessing still worth the time? A comparative survey on the influence of popular preprocessing methods on transformers and traditional classifiers,” Information Systems, vol. 121, pp. 1–19, 2024.
[31] L. J. Laki and Z. G. Yang, “Neural machine translation for Hungarian,” Acta Linguistica Academica, vol. 69, no. 4, pp. 501–520, 2022.
[32] D. Roy, S. Fakhoury, and V. Arnaoudova, “Reassessing automatic evaluation metrics for code summarization tasks,” in Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York, United States: Association for Computing Machinery, Aug. 23–28, 2021, pp. 1105–1116.
[33] S. Cahyawijaya, G. I. Winata, B. Wilie, K. Vincentio, X. Li, A. Kuncoro, S. Ruder, Z. Y. Lim, S. Bahar, M. Khodra, A. Purwarianti, and P. Fung, “IndoNLG: Benchmark and resources for evaluating indonesian natural language generation,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, 2021, pp. 8875–8898.
[34] A. A. Suryani, D. H. Widyantoro, A. Purwarianti, and Y. Sudaryat, “Sundanese-Indonesian parallel corpus,” 2022. [Online]. Available: https://doi.org/10.34820/FK2/HDYWXW
[35] A. Kathuria, A. Gupta, and R. K. Singla, “A review of tools and techniques for preprocessing of textual data,” in Computational Methods and Data Engineering. Springer, 2020, pp. 407–422.
[36] R. Egger and E. Gokce, “Natural Language Processing (NLP): An introduction: Making sense of textual data,” in Applied data science in tourism: Interdisciplinary approaches, methodologies, and applications. Springer, 2022, pp. 307–334.
[37] H. Hamed, A. M. Helmy, and A. Mohammed, “Deep learning approach for translating Arabic Holy Quran into Italian language,” in 2021 International Mobile, Intelligent, and Ubiquitous Computing Conference (MIUCC). Cairo, Egypt: IEEE, May 26–27, 2021, pp. 193–199.
[38] H. K. Vydana, M. Karafi´at, K. Zmolikova, L. Burget, and H. Cˇ ernocky`, “Jointly trained transformers models for spoken language translation,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto, Canada: IEEE, June 6–11, 2021, pp. 7513–7517.
[39] M. Gupta and P. Kumar, “Robust neural language translation model formulation using Seq2Seq approach,” Fusion: Practice and Applications, vol. 5, no. 2, pp. 61–67, 2021.
[40] Z. Abidin, P. Permata, and F. Ariyani, “Translation of the Lampung language text dialect of Nyo into the Indonesian language with DMT and SMT approach,” INTENSIF: Jurnal Ilmiah Penelitian dan Penerapan Teknologi Sistem Informasi, vol. 5, no. 1, pp. 58–71, 2021.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Anggi Muhammad Rifai, Ema Utami, Amali, Muhamad Fatchan, Muhamad Ekhsan

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
a. Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License - Share Alike that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
b. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
c. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.
Â
USER RIGHTS
All articles published Open Access will be immediately and permanently free for everyone to read and download. We are continuously working with our author communities to select the best choice of license options, currently being defined for this journal as follows: Creative Commons Attribution-Share Alike (CC BY-SA)


















