Advancing Cross-Cultural Natural Language Processing with a Focus on Sundanese Language and Contextual Nuances

Anggi Muhammad Rifai; Ema Utami; Amali Amali; Muhamad Fatchan; Muhamad Ekhsan

Authors

Anggi Muhammad Rifai Universitas Pelita Bangsa https://orcid.org/0000-0002-6199-2029
Ema Utami Universitas Amikom Yogyakarta
Amali Universitas Pelita Bangsa
Muhamad Fatchan Universitas Pelita Bangsa
Muhamad Ekhsan Universitas Pelita Bangsa

Keywords:

Cross-Cultural Natural Language Processing, Seq2Seq Transformer, Sundanese Language Translation, Low-Resource Language Processing

Abstract

The Sundanese language, as one of Indonesia’s regional tongues, holds deep cultural value but is still underrepresented in computational linguistics. The research addresses this gap by developing a translation model between Sundanese and Indonesian using a transformer-based sequence-to-sequence (Seq2Seq) architecture. With a parallel dataset of 3,616 sentence pairs, the model is fine-tuned to capture linguistic and contextual subtleties. The evaluation yields strong results: Bilingual Evaluation Understudy (BLEU) score of 44.12, Recall - Oriented Understudy for Gisting Evaluation (ROUGE)-1 F1-Score of 0.72, and ROUGE-L F1-Score of 0.71. Those demonstrate high translation quality despite limited data. Unlike earlier Sundanese translation studies that rely on Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), or standard transformer models, this research uniquely leverages the multilingual pretrained M2M100 Transformer, enabling transfer learning from high-resource languages to improve low-resource performance. These outcomes highlight the model’s potential for real-world applications, such as translation tools for education and cultural exchange. The research emphasizes the importance of improving access to Sundanese texts and promoting its digital presence to aid in language preservation. Overall, the research not only advances Natural Language Processing (NLP) research for low-resource languages but also reinforces the importance of integrating regional languages like Sundanese into modern technology. Building upon prior studies on Indonesian–Sundanese translation, the research novelty lies in fine-tuning a multilingual Seq2Seq Transformer that captures both linguistic and contextual nuances, thereby setting a new benchmark for lowresource language processing.

Dimensions

Author Biographies

Anggi Muhammad Rifai, Universitas Pelita Bangsa

Department of Informatics Engineering, Faculty of Engineering

Ema Utami, Universitas Amikom Yogyakarta

Doctoral Program in Informatics

Amali, Universitas Pelita Bangsa

Department of Informatics Engineering, Faculty of Engineering

Muhamad Fatchan, Universitas Pelita Bangsa

Department of Informatics Engineering, Faculty of Engineering

Muhamad Ekhsan, Universitas Pelita Bangsa

Department of Management, Faculty of Economics and Business

References

[1] D. Khurana, A. Koli, K. Khatter, and S. Singh,

“Natural language processing: State of the art,

current trends and challenges,” Multimedia Tools

and Applications, vol. 82, no. 3, pp. 3713–3744,

2023.

[2] R. Ruiz-Dolz, J. Taverner, J. Lawrence, and

C. Reed, “NLAS-multi: A multilingual corpus

of automatically generated natural language argumentation

schemes,” Data in Brief, vol. 57, pp.

1–10, 2024.

[3] N. Ahmed, A. K. Saha, M. A. Al Noman, J. R.

Jim, M. F. Mridha, and M. M. Kabir, “Deep

learning-based natural language processing in

human–agent interaction: Applications, advancements

and challenges,” Natural Language Processing

Journal, vol. 9, pp. 1–25, 2024.

[4] Supriyono, A. P. Wibawa, Suyono, and F. Kurniawan,

“Advancements in natural language processing:

Implications, challenges, and future directions,”

Telematics and Informatics Reports,

vol. 16, pp. 1–17, 2024.

[5] H. Wang, J. Li, H. Wu, E. Hovy, and Y. Sun, “Pretrained

language models and their applications,”

Engineering, vol. 25, pp. 51–65, 2023.

[6] J. A. Ruip´erez-Valiente, T. Staubitz, M. Jenner,

S. Halawa, J. Zhang, I. Despujol, J. Maldonado-

Mahauad, G. Montoro, M. Peffer, T. Rohloff,

J. Lane, C. Turro, X. Li, M. P´erez-Sanagust´ın,

and J. Reich, “Large scale analytics of global

and regional MOOC providers: Differences in

learners’ demographics, preferences, and perceptions,”

Computers & Education, vol. 180, pp. 1–

17, 2022.

[7] Y. Liu and M. Dras, “Using corpora from natural

language processing for investigating crosslinguistic

influence,” Ampersand, vol. 12, pp. 1–13,

2024.

[8] W. Khan, A. Daud, K. Khan, S. Muhammad, and

R. Haq, “Exploring the frontiers of deep learning

and natural language processing: A comprehensive

overview of key challenges and emerging

trends,” Natural Language Processing Journal,

vol. 4, pp. 1–31, 2023.

[9] N. Nnamoko, T. Karaminis, J. Procter, J. Barrowclough,

and I. Korkontzelos, “Automatic language

ability assessment method based on natural language

processing,” Natural Language Processing

Journal, vol. 8, pp. 1–16, 2024.

[10] S. Sarip, D. Fitriana, A. F. Azhari, A. Absori,

E. K. Dewi, H. N. Adiantika, and

N. Nurkhaeriyah, “Policy and linguistic considerations

in the proposed renaming of West Java

Province to Tatar Sunda,” Cepalo, vol. 8, no. 1,

pp. 31–48, 2024.

[11] A. F. Aji, G. I. Winata, F. Koto, S. Cahyawijaya,

A. Romadhony, R. Mahendra, K. Kurniawan,

D. Moeljadi, R. E. Prasojo, T. Baldwin, J. H. Lau,

and S. Ruder, “One country, 700+ languages:

NLP challenges for underrepresented languages

and dialects in Indonesia,” in Proceedings of the

60th Annual Meeting of the Association for Computational

Linguistics (Volume 1: Long Papers).

Dublin, Ireland: Association for Computational

Linguistics, 2022, pp. 7226–7249.

[12] I. Widianingsih, J. J. McIntyre, U. S. Rakasiwi,

G. H. Iskandar, and R. Wirawan, “Indigenous

sundanese leadership: Eco-systemic lessons on

zero emissions: A conversation with Indigenous

leaders in Ciptagelar, West Java,” Systemic Practice

and Action Research, vol. 36, no. 2, pp. 321–

353, 2023.

[13] M. Javaid, A. Haleem, R. P. Singh, and R. Suman,

“Artificial intelligence applications for Industry

4.0: A literature-based study,” Journal of Industrial

Integration and Management, vol. 7, no. 01,

pp. 83–111, 2022.

[14] M. Mager, E. Maier, K. von der Wense, and

N. T. Vu, “Ethical considerations for machine

translation of indigenous languages: Giving a

voice to the speakers,” in Proceedings of the

61st Annual Meeting of the Association for Computational

Linguistics (Volume 1: Long Papers).

Toronto, Canada: Association for Computational

Linguistics, 2023, pp. 4871–4897.

[15] H. Sujaini and A. B. Putra, “Analysis of language

identification algorithms for regional Indonesian

languages,” IAES International Journal of Artificial

Intelligence (IJ-AI), vol. 13, no. 2, pp. 1741–

1752, 2024.

[16] A. Montejo-R´aez, M. D. Molina-Gonz´alez, S. M.

Jime´nez-Zafra, M. A´ . Garc´ıa-Cumbreras, and

L. J. Garc´ıa-L´opez, “A survey on detecting mental

disorders with natural language processing: Literature

review, trends and challenges,” Computer

Science Review, vol. 53, pp. 1–17, 2024.

[17] B. Masua and N. Masasi, “In the heart of Swahili:

An exploration of data collection methods and

corpus curation for natural language processing,”

Data in Brief, vol. 55, pp. 1–9, 2024.

[18] B. Zhang, P. Williams, I. Titov, and R. Sennrich,

“Improving massively multilingual neural

machine translation and zero-shot translation,” in

Proceedings of the 58th Annual Meeting of the

Association for Computational Linguistics. Online:

Association for Computational Linguistics, 2020, pp. 1628–1639.

[19] S. Cahyawijaya, H. Lovenia, F. Koto, D. Adhista,

E. Dave, S. Oktavianti, S. Akbar, J. Lee,

N. Shadieq, T. W. Cenggoro, H. Lunuwih,

B. Wilie, G. Muridan, G. Winata, D. Moeljadi,

A. F. Aji, A. Purwarianti, and P. Fung, “NusaWrites:

Constructing high-quality corpora for

underrepresented and extremely low-resource languages,”

in Proceedings of the 13th International

Joint Conference on Natural Language Processing

and the 3rd Conference of the Asia-Pacific

Chapter of the Association for Computational

Linguistics (Volume 1: Long Papers). Nusa Dua,

Bali: Association for Computational Linguistics,

2023, pp. 921–945.

[20] B. D. Wijanarko, Y. Heryadi, D. F. Murad,

C. Tho, and K. Hashimoto, “Recurrent neural

network-based models as Bahasa Indonesia-

Sundanese language neural machine translator,”

in 2023 International Conference on Computer

Science, Information Technology and Engineering

(ICCoSITE). Jakarta, Indonesia: IEEE, Feb. 16,

2023, pp. 951–956.

[21] E. A. Kusnanti, E. Sierra, G. G. S. Putra, E. S.

Cahyadi, A. Haq, and D. Purwitasari, “Indonesian

lexical ambiguity in machine translation: A literature

review,” in 2024 International Conference on

Information Technology Research and Innovation

(ICITRI). Jakarta, Indonesia: IEEE, Sep. 5–6,

2024, pp. 59–64.

[22] A. Tambusai and K. Nasution, “A comparative

typology of verbal affixes in Riau-Malay and

Sundanese,” Indonesian Journal of Applied Linguistics,

vol. 13, no. 3, pp. 636–647, 2024.

[23] G. L. A. Babu and S. Badugu, “Deep learning

based sequence to sequence model for abstractive

Telugu text summarization,” Multimedia Tools

and Applications, vol. 82, no. 11, pp. 17 075–

17 096, 2023.

[24] A. Rahali and M. A. Akhloufi, “End-to-end

transformer-based models in textual-based NLP,”

Ai, vol. 4, no. 1, pp. 54–110, 2023.

[25] N. A. Al-Shameri and H. S. Al-Khalifa, “Arabic

paraphrase generation using transformer-based

approaches,” IEEE Access, vol. 12, pp. 121 896–

121 914, 2024.

[26] P. Prasada and M. V. P. Rao, “Reinforcement

of low-resource language translation with neural

machine translation and backtranslation synergies,”

IAES International Journal of Artificial

Intelligence (IJ-AI), vol. 13, no. 3, pp. 3478–

3488, 2024.

[27] P. U. Ogbogu, L. M. Noroski, K. Arcoleo, B. D.

Reese Jr., and A. J. Apter, “Methods for crosscultural

communication in clinic encounters,” The

Journal of Allergy and Clinical Immunology: In

Practice, vol. 10, no. 4, pp. 893–900, 2022.

[28] D. Peral-Garc´ıa, J. Cruz-Benito, and F. J. Garc´ıa-

Pe˜nalvo, “Comparing natural language processing

and quantum natural processing approaches in

text classification tasks,” Expert Systems with

Applications, vol. 254, pp. 1–9, 2024.

[29] D. W. Otter, J. R. Medina, and J. K. Kalita, “A

survey of the usages of deep learning for natural

language processing,” IEEE Transactions on

Neural Networks and Learning Systems, vol. 32,

no. 2, pp. 604–624, 2020.

[30] M. Siino, I. Tinnirello, and M. La Cascia, “Is text

preprocessing still worth the time? A comparative

survey on the influence of popular preprocessing

methods on transformers and traditional classifiers,”

Information Systems, vol. 121, pp. 1–19,

2024.

[31] L. J. Laki and Z. G. Yang, “Neural machine

translation for Hungarian,” Acta Linguistica Academica,

vol. 69, no. 4, pp. 501–520, 2022.

[32] D. Roy, S. Fakhoury, and V. Arnaoudova, “Reassessing

automatic evaluation metrics for code

summarization tasks,” in Proceedings of the 29th

ACM Joint Meeting on European Software Engineering

Conference and Symposium on the

Foundations of Software Engineering. New

York, United States: Association for Computing

Machinery, Aug. 23–28, 2021, pp. 1105–1116.

[33] S. Cahyawijaya, G. I. Winata, B. Wilie, K. Vincentio,

X. Li, A. Kuncoro, S. Ruder, Z. Y. Lim,

S. Bahar, M. Khodra, A. Purwarianti, and P. Fung,

“IndoNLG: Benchmark and resources for evaluating

indonesian natural language generation,” in

Proceedings of the 2021 Conference on Empirical

Methods in Natural Language Processing.

Online and Punta Cana, Dominican Republic:

Association for Computational Linguistics, 2021,

pp. 8875–8898.

[34] A. A. Suryani, D. H. Widyantoro, A. Purwarianti,

and Y. Sudaryat, “Sundanese-Indonesian parallel

corpus,” 2022. [Online]. Available: https://doi.

org/10.34820/FK2/HDYWXW

[35] A. Kathuria, A. Gupta, and R. K. Singla, “A

review of tools and techniques for preprocessing

of textual data,” in Computational Methods and

Data Engineering. Springer, 2020, pp. 407–422.

[36] R. Egger and E. Gokce, “Natural Language Processing

(NLP): An introduction: Making sense of

textual data,” in Applied data science in tourism:

Interdisciplinary approaches, methodologies, and applications. Springer, 2022, pp. 307–334.

[37] H. Hamed, A. M. Helmy, and A. Mohammed,

“Deep learning approach for translating Arabic

Holy Quran into Italian language,” in 2021 International

Mobile, Intelligent, and Ubiquitous

Computing Conference (MIUCC). Cairo, Egypt:

IEEE, May 26–27, 2021, pp. 193–199.

[38] H. K. Vydana, M. Karafi´at, K. Zmolikova, L. Burget,

and H. Cˇ ernocky`, “Jointly trained transformers models for spoken language translation,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto, Canada: IEEE, June 6–11, 2021, pp. 7513–7517.

[39] M. Gupta and P. Kumar, “Robust neural language translation model formulation using Seq2Seq approach,” Fusion: Practice and Applications, vol. 5, no. 2, pp. 61–67, 2021.

[40] Z. Abidin, P. Permata, and F. Ariyani, “Translation of the Lampung language text dialect of Nyo into the Indonesian language with DMT and SMT approach,” INTENSIF: Jurnal Ilmiah Penelitian dan Penerapan Teknologi Sistem Informasi, vol. 5, no. 1, pp. 58–71, 2021.