Advancing Cross-Cultural Natural Language Processing with a Focus on Sundanese Language and Contextual Nuances
Keywords:
Cross-Cultural Natural Language Processing, Seq2Seq Transformer, Sundanese Language Translation, Low-Resource Language ProcessingAbstract
The Sundanese language, as one of Indonesia’s regional tongues, holds deep cultural value but is still underrepresented in computational linguistics. The research addresses this gap by developing a translation model between Sundanese and Indonesian using a transformer-based sequence-to-sequence (Seq2Seq) architecture. With a parallel dataset of 3,616 sentence pairs, the model is fine-tuned to capture linguistic and contextual subtleties. The evaluation yields strong results: Bilingual Evaluation Understudy (BLEU) score of 44.12, Recall - Oriented Understudy for Gisting Evaluation (ROUGE)-1 F1-Score of 0.72, and ROUGE-L F1-Score of 0.71. Those demonstrate high translation quality despite limited data. Unlike earlier Sundanese translation studies that rely on Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), or standard transformer models, this research uniquely leverages the multilingual pretrained M2M100 Transformer, enabling transfer learning from high-resource languages to improve low-resource performance. These outcomes highlight the model’s potential for real-world applications, such as translation tools for education and cultural exchange. The research emphasizes the importance of improving access to Sundanese texts and promoting its digital presence to aid in language preservation. Overall, the research not only advances Natural Language Processing (NLP) research for low-resource languages but also reinforces the importance of integrating regional languages like Sundanese into modern technology. Building upon prior studies on Indonesian–Sundanese translation, the research novelty lies in fine-tuning a multilingual Seq2Seq Transformer that captures both linguistic and contextual nuances, thereby setting a new benchmark for lowresource language processing.
References
[1] D. Khurana, A. Koli, K. Khatter, and S. Singh,
“Natural language processing: State of the art,
current trends and challenges,” Multimedia Tools
and Applications, vol. 82, no. 3, pp. 3713–3744,
2023.
[2] R. Ruiz-Dolz, J. Taverner, J. Lawrence, and
C. Reed, “NLAS-multi: A multilingual corpus
of automatically generated natural language argumentation
schemes,” Data in Brief, vol. 57, pp.
1–10, 2024.
[3] N. Ahmed, A. K. Saha, M. A. Al Noman, J. R.
Jim, M. F. Mridha, and M. M. Kabir, “Deep
learning-based natural language processing in
human–agent interaction: Applications, advancements
and challenges,” Natural Language Processing
Journal, vol. 9, pp. 1–25, 2024.
[4] Supriyono, A. P. Wibawa, Suyono, and F. Kurniawan,
“Advancements in natural language processing:
Implications, challenges, and future directions,”
Telematics and Informatics Reports,
vol. 16, pp. 1–17, 2024.
[5] H. Wang, J. Li, H. Wu, E. Hovy, and Y. Sun, “Pretrained
language models and their applications,”
Engineering, vol. 25, pp. 51–65, 2023.
[6] J. A. Ruip´erez-Valiente, T. Staubitz, M. Jenner,
S. Halawa, J. Zhang, I. Despujol, J. Maldonado-
Mahauad, G. Montoro, M. Peffer, T. Rohloff,
J. Lane, C. Turro, X. Li, M. P´erez-Sanagust´ın,
and J. Reich, “Large scale analytics of global
and regional MOOC providers: Differences in
learners’ demographics, preferences, and perceptions,”
Computers & Education, vol. 180, pp. 1–
17, 2022.
[7] Y. Liu and M. Dras, “Using corpora from natural
language processing for investigating crosslinguistic
influence,” Ampersand, vol. 12, pp. 1–13,
2024.
[8] W. Khan, A. Daud, K. Khan, S. Muhammad, and
R. Haq, “Exploring the frontiers of deep learning
and natural language processing: A comprehensive
overview of key challenges and emerging
trends,” Natural Language Processing Journal,
vol. 4, pp. 1–31, 2023.
[9] N. Nnamoko, T. Karaminis, J. Procter, J. Barrowclough,
and I. Korkontzelos, “Automatic language
ability assessment method based on natural language
processing,” Natural Language Processing
Journal, vol. 8, pp. 1–16, 2024.
[10] S. Sarip, D. Fitriana, A. F. Azhari, A. Absori,
E. K. Dewi, H. N. Adiantika, and
N. Nurkhaeriyah, “Policy and linguistic considerations
in the proposed renaming of West Java
Province to Tatar Sunda,” Cepalo, vol. 8, no. 1,
pp. 31–48, 2024.
[11] A. F. Aji, G. I. Winata, F. Koto, S. Cahyawijaya,
A. Romadhony, R. Mahendra, K. Kurniawan,
D. Moeljadi, R. E. Prasojo, T. Baldwin, J. H. Lau,
and S. Ruder, “One country, 700+ languages:
NLP challenges for underrepresented languages
and dialects in Indonesia,” in Proceedings of the
60th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers).
Dublin, Ireland: Association for Computational
Linguistics, 2022, pp. 7226–7249.
[12] I. Widianingsih, J. J. McIntyre, U. S. Rakasiwi,
G. H. Iskandar, and R. Wirawan, “Indigenous
sundanese leadership: Eco-systemic lessons on
zero emissions: A conversation with Indigenous
leaders in Ciptagelar, West Java,” Systemic Practice
and Action Research, vol. 36, no. 2, pp. 321–
353, 2023.
[13] M. Javaid, A. Haleem, R. P. Singh, and R. Suman,
“Artificial intelligence applications for Industry
4.0: A literature-based study,” Journal of Industrial
Integration and Management, vol. 7, no. 01,
pp. 83–111, 2022.
[14] M. Mager, E. Maier, K. von der Wense, and
N. T. Vu, “Ethical considerations for machine
translation of indigenous languages: Giving a
voice to the speakers,” in Proceedings of the
61st Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers).
Toronto, Canada: Association for Computational
Linguistics, 2023, pp. 4871–4897.
[15] H. Sujaini and A. B. Putra, “Analysis of language
identification algorithms for regional Indonesian
languages,” IAES International Journal of Artificial
Intelligence (IJ-AI), vol. 13, no. 2, pp. 1741–
1752, 2024.
[16] A. Montejo-R´aez, M. D. Molina-Gonz´alez, S. M.
Jime´nez-Zafra, M. A´ . Garc´ıa-Cumbreras, and
L. J. Garc´ıa-L´opez, “A survey on detecting mental
disorders with natural language processing: Literature
review, trends and challenges,” Computer
Science Review, vol. 53, pp. 1–17, 2024.
[17] B. Masua and N. Masasi, “In the heart of Swahili:
An exploration of data collection methods and
corpus curation for natural language processing,”
Data in Brief, vol. 55, pp. 1–9, 2024.
[18] B. Zhang, P. Williams, I. Titov, and R. Sennrich,
“Improving massively multilingual neural
machine translation and zero-shot translation,” in
Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics. Online:
Association for Computational Linguistics, 2020, pp. 1628–1639.
[19] S. Cahyawijaya, H. Lovenia, F. Koto, D. Adhista,
E. Dave, S. Oktavianti, S. Akbar, J. Lee,
N. Shadieq, T. W. Cenggoro, H. Lunuwih,
B. Wilie, G. Muridan, G. Winata, D. Moeljadi,
A. F. Aji, A. Purwarianti, and P. Fung, “NusaWrites:
Constructing high-quality corpora for
underrepresented and extremely low-resource languages,”
in Proceedings of the 13th International
Joint Conference on Natural Language Processing
and the 3rd Conference of the Asia-Pacific
Chapter of the Association for Computational
Linguistics (Volume 1: Long Papers). Nusa Dua,
Bali: Association for Computational Linguistics,
2023, pp. 921–945.
[20] B. D. Wijanarko, Y. Heryadi, D. F. Murad,
C. Tho, and K. Hashimoto, “Recurrent neural
network-based models as Bahasa Indonesia-
Sundanese language neural machine translator,”
in 2023 International Conference on Computer
Science, Information Technology and Engineering
(ICCoSITE). Jakarta, Indonesia: IEEE, Feb. 16,
2023, pp. 951–956.
[21] E. A. Kusnanti, E. Sierra, G. G. S. Putra, E. S.
Cahyadi, A. Haq, and D. Purwitasari, “Indonesian
lexical ambiguity in machine translation: A literature
review,” in 2024 International Conference on
Information Technology Research and Innovation
(ICITRI). Jakarta, Indonesia: IEEE, Sep. 5–6,
2024, pp. 59–64.
[22] A. Tambusai and K. Nasution, “A comparative
typology of verbal affixes in Riau-Malay and
Sundanese,” Indonesian Journal of Applied Linguistics,
vol. 13, no. 3, pp. 636–647, 2024.
[23] G. L. A. Babu and S. Badugu, “Deep learning
based sequence to sequence model for abstractive
Telugu text summarization,” Multimedia Tools
and Applications, vol. 82, no. 11, pp. 17 075–
17 096, 2023.
[24] A. Rahali and M. A. Akhloufi, “End-to-end
transformer-based models in textual-based NLP,”
Ai, vol. 4, no. 1, pp. 54–110, 2023.
[25] N. A. Al-Shameri and H. S. Al-Khalifa, “Arabic
paraphrase generation using transformer-based
approaches,” IEEE Access, vol. 12, pp. 121 896–
121 914, 2024.
[26] P. Prasada and M. V. P. Rao, “Reinforcement
of low-resource language translation with neural
machine translation and backtranslation synergies,”
IAES International Journal of Artificial
Intelligence (IJ-AI), vol. 13, no. 3, pp. 3478–
3488, 2024.
[27] P. U. Ogbogu, L. M. Noroski, K. Arcoleo, B. D.
Reese Jr., and A. J. Apter, “Methods for crosscultural
communication in clinic encounters,” The
Journal of Allergy and Clinical Immunology: In
Practice, vol. 10, no. 4, pp. 893–900, 2022.
[28] D. Peral-Garc´ıa, J. Cruz-Benito, and F. J. Garc´ıa-
Pe˜nalvo, “Comparing natural language processing
and quantum natural processing approaches in
text classification tasks,” Expert Systems with
Applications, vol. 254, pp. 1–9, 2024.
[29] D. W. Otter, J. R. Medina, and J. K. Kalita, “A
survey of the usages of deep learning for natural
language processing,” IEEE Transactions on
Neural Networks and Learning Systems, vol. 32,
no. 2, pp. 604–624, 2020.
[30] M. Siino, I. Tinnirello, and M. La Cascia, “Is text
preprocessing still worth the time? A comparative
survey on the influence of popular preprocessing
methods on transformers and traditional classifiers,”
Information Systems, vol. 121, pp. 1–19,
2024.
[31] L. J. Laki and Z. G. Yang, “Neural machine
translation for Hungarian,” Acta Linguistica Academica,
vol. 69, no. 4, pp. 501–520, 2022.
[32] D. Roy, S. Fakhoury, and V. Arnaoudova, “Reassessing
automatic evaluation metrics for code
summarization tasks,” in Proceedings of the 29th
ACM Joint Meeting on European Software Engineering
Conference and Symposium on the
Foundations of Software Engineering. New
York, United States: Association for Computing
Machinery, Aug. 23–28, 2021, pp. 1105–1116.
[33] S. Cahyawijaya, G. I. Winata, B. Wilie, K. Vincentio,
X. Li, A. Kuncoro, S. Ruder, Z. Y. Lim,
S. Bahar, M. Khodra, A. Purwarianti, and P. Fung,
“IndoNLG: Benchmark and resources for evaluating
indonesian natural language generation,” in
Proceedings of the 2021 Conference on Empirical
Methods in Natural Language Processing.
Online and Punta Cana, Dominican Republic:
Association for Computational Linguistics, 2021,
pp. 8875–8898.
[34] A. A. Suryani, D. H. Widyantoro, A. Purwarianti,
and Y. Sudaryat, “Sundanese-Indonesian parallel
corpus,” 2022. [Online]. Available: https://doi.
org/10.34820/FK2/HDYWXW
[35] A. Kathuria, A. Gupta, and R. K. Singla, “A
review of tools and techniques for preprocessing
of textual data,” in Computational Methods and
Data Engineering. Springer, 2020, pp. 407–422.
[36] R. Egger and E. Gokce, “Natural Language Processing
(NLP): An introduction: Making sense of
textual data,” in Applied data science in tourism:
Interdisciplinary approaches, methodologies, and applications. Springer, 2022, pp. 307–334.
[37] H. Hamed, A. M. Helmy, and A. Mohammed,
“Deep learning approach for translating Arabic
Holy Quran into Italian language,” in 2021 International
Mobile, Intelligent, and Ubiquitous
Computing Conference (MIUCC). Cairo, Egypt:
IEEE, May 26–27, 2021, pp. 193–199.
[38] H. K. Vydana, M. Karafi´at, K. Zmolikova, L. Burget,
and H. Cˇ ernocky`, “Jointly trained transformers models for spoken language translation,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto, Canada: IEEE, June 6–11, 2021, pp. 7513–7517.
[39] M. Gupta and P. Kumar, “Robust neural language translation model formulation using Seq2Seq approach,” Fusion: Practice and Applications, vol. 5, no. 2, pp. 61–67, 2021.
[40] Z. Abidin, P. Permata, and F. Ariyani, “Translation of the Lampung language text dialect of Nyo into the Indonesian language with DMT and SMT approach,” INTENSIF: Jurnal Ilmiah Penelitian dan Penerapan Teknologi Sistem Informasi, vol. 5, no. 1, pp. 58–71, 2021.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Anggi Muhammad Rifai, Ema Utami, Amali, Muhamad Fatchan, Muhamad Ekhsan

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
a. Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License - Share Alike that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
b. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
c. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.
Â
USER RIGHTS
All articles published Open Access will be immediately and permanently free for everyone to read and download. We are continuously working with our author communities to select the best choice of license options, currently being defined for this journal as follows: Creative Commons Attribution-Share Alike (CC BY-SA)

















