Balinese Language Classification on Social Media using Multinomial Naive Bayes Method with TF-IDF

Authors

  • Putu Widyantara Artanta Wibawa Informatics Study Program, Faculty of Mathematics and Natural Sciences, Udayana University, Bali, Indonesia 80361 https://orcid.org/0009-0001-7471-6680
  • Cokorda Pramartha Informatics Study Program, Faculty of Mathematics and Natural Sciences, Udayana University, Bali, Indonesia 80361 https://orcid.org/0000-0002-2835-3989
  • I Gusti Ngurah Anom Cahyadi Putra Informatics Study Program, Faculty of Mathematics and Natural Sciences, Udayana University, Bali, Indonesia 80361 https://orcid.org/0000-0001-8408-7091
  • Luh Gede Astuti Informatics Study Program, Faculty of Mathematics and Natural Sciences, Udayana University, Bali, Indonesia 80361 https://orcid.org/0009-0006-9571-199X

DOI:

https://doi.org/10.21512/comtech.v17i1.14132

Keywords:

classification, Balinese language, TF-IDF, Chi-square, SMOTE, Multinomial Naive Bayes

Abstract

Balinese language is a local language that is widely use and spoken by Balinese people including in social media.  However, the nuances of these politeness levels are often lost in informal digital communication and there is a significant lack of computational model to automatically classify them, especially for low-resource language like Balinese. The primary objective of this study is to evaluate the performance of the Multinomial Naive Bayes method combined with Term Frequency-Inverse Document Frequency (TF-IDF) feature extraction, Chi-square feature selection, and Synthetic Minority Oversampling Technique (SMOTE) in classifying Balinese language levels. The dataset for this study consists of 1,314 annotated social media posts and comments, primarily sourced from Instagram. The annotation was conducted by a Balinese language expert to categorize text into six levels that represent varying degrees of politeness and formality. These levels are alus singgih (polite, used for respecting others), alus sor (polite, used for self-humbling), alus mider (polite, used for both respecting others and self-humbling), alus madia (an intermediate level of politeness), basa andap (casual, commonly used in everyday life), and basa kasar (impolite, often used during arguments or toward animals). The experimental results showed that the model successfully achieved an accuracy of 96.53% on the training data and 61.45% on the test data. Additionally, hyperparameter tuning revealed that the Multinomial Naive Bayes model with 2,720 selected features and SMOTE oversampling achieved an accuracy of 91.78%, significantly outperforming the baseline model without feature selection and oversampling, which obtained only 64.93% accuracy.

Dimensions

References

Agus, M., Subali, P., & Fatichah, C. (2019). Kombinasi Metode Rule-Based dan N-Gram Stemming untuk Mengenali Stemmer Bahasa Bali. Jurnal Teknologi Informasi Dan Ilmu Komputer, 6(2), 219–228. https://doi.org/10.25126/JTIIK.2019621105

Angeline, G., Wibawa, A. P., & Pujianto, U. (2022). Klasifikasi Dialek Bahasa Jawa Menggunakan Metode Naives Bayes. Jurnal Mnemonic, 5(2), 103–110. https://doi.org/10.36040/mnemonic.v5i2.4748

Ardhana, A. P. (2018). Klasifikasi Tingkatan Bahasa pada Artikel Berbahasa Jawa dengan Metode Multinomial Naïve Bayes.

Azad, R., Mohammed, B., Mahmud, R., Zrar, L., & Sdiq, S. (2021). Fake News Detection in low-resourced languages “Kurdish language” using Machine learning algorithms. Turkish Journal of Computer and Mathematics Education, 12(6), 4219–4225.

Bahassine, S., Madani, A., Al-Sarem, M., & Kissi, M. (2020). Feature selection using an improved Chi-square for Arabic text classification. Journal of King Saud University - Computer and Information Sciences, 32(2), 225–231. https://doi.org/10.1016/J.JKSUCI.2018.05.010

Chen, W., Yang, K., Yu, Z., Shi, Y., & Chen, C. L. P. (2024). A survey on imbalanced learning: latest research, applications and future directions. Artificial Intelligence Review, 57(6), 1–51. https://doi.org/10.1007/S10462-024-10759-6/FIGURES/11

Damanik, F. J., & Setyohadi, D. B. (2021). Analysis Of Public Sentiment About Covid-19 In Indonesia On Twitter Using Multinomial Naive Bayes And Support Vector Machine. IOP Conference Series: Earth and Environmental Science, 704(1), 012027. https://doi.org/10.1088/1755-1315/704/1/012027

Dewi, D. A. E. R., & Putra, A. A. N. M. A. (2021). Kebencian Pada Bahasa Bali Dengan Metode Naive Bayes. Jurnal Teknologi Informasi Dan Komputer, 7(2).

Gifari, O. I., Adha, M., Rifky Hendrawan, I., Freddy, F., & Durrand, S. (2022). Film Review Sentiment Analysis Using TF-IDF and Support Vector Machine. Journal of Information Technology, 2(1), 36–40. https://doi.org/10.46229/JIFOTECH.V2I1.330

Hamzah, M. B. (2021). Classification of Movie Review Sentiment Analysis Using Chi-Square and Multinomial Naïve Bayes with Adaptive Boosting. Journal of Advances in Information Systems and Technology, 3(1), 67–74. https://doi.org/10.15294/JAIST.V3I1.49098

Hickman, L., Thapa, S., Tay, L., Cao, M., & Srinivasan, P. (2022). Text Preprocessing for Text Mining in Organizational Research: Review and Recommendations. Organizational Research Methods, 25(1), 114–146. https://doi.org/10.1177/1094428120971683;WEBSITE:WEBSITE:SAGE;JOURNAL:JOURNAL:ORMA;REQUESTEDJOURNAL:JOURNAL:ORMA;WGROUP:STRING:PUBLICATION

Hossain, R., & Timmer, D. (2021). Machine Learning Model Optimization with Hyper Parameter Tuning Approach. Glob. J. Comput. Sci. Technol. D Neural Artif. Intell, 21(2).

Made, I., Wirawan, W., & Pramartha, C. (2022). PENGEMBANGAN SISTEM INFORMASI PENANGANAN PENDERITA GANGGUAN JIWA DENGAN PENDEKATAN ENTEPRISE SYSTEMS. SINTECH (Science and Information Technology) Journal, 5(1), 31–41. https://doi.org/10.31598/SINTECHJOURNAL.V5I1.1070

Mastini, G. N., Kantriani, N. K., & Arini, N. W. (2021). Peran Media Sosial Instagram Dalam Upaya Menjaga Eksistensi Bahasa Bali. Ganaya : Jurnal Ilmu Sosial Dan Humaniora, 4(2), 686–695. https://doi.org/10.37329/ganaya.v4i2.1414

Nti, I. K., Nyarko-Boateng, O., & Aning, J. (2021). Performance of Machine Learning Algorithms with Different K Values in K-fold CrossValidation. International Journal of Information Technology and Computer Science, 13(6), 61–71. https://doi.org/10.5815/IJITCS.2021.06.05

Nugraha, P. G. S. C., & Wardani, N. W. (2020). STEMMING DOKUMEN TEKS BAHASA BALI DENGAN METODE RULE BASE APPROACH. JATISI, 7(3), 510–521. https://doi.org/10.35957/JATISI.V7I3.538

Pramartha, C., Made, I., Mahendra, Y., Primahadi, G., Rajeg, W., & Arka, W. (2023). The Development of Semantic Dictionary Prototype for the Balinese Language. International Journal of Cyber and IT Service Management (IJCITSM), 3(2), 96–106. https://doi.org/10.34306/IJCITSM.V3I2.132

Sosiawan, P., Martha, I. N., & Artika, I. W. (2021). PENGGUNAAN BAHASA BALI PADA KELUARGA MUDA DI KOTA SINGARAJA. Jurnal Pendidikan Dan Pembelajaran Bahasa Indonesia, 10(1), 40–54. https://doi.org/10.23887/JURNAL_BAHASA.V10I1.403

Raza, M. O., Mahoto, N. A., Shaikh, A., Pathan, N., Alshahrani, H., & Elmagzoub, M. A. (2025). A Machine Learning Approach of Text Classification for High- and Low-Resource Languages. Computational Intelligence, 41(4), e70114. https://doi.org/10.1111/COIN.70114

Suardiana, I. W. (2019). Bahasa Bali dan Pemertahanan Kearifan Lokal. Linguistika, 19(1), 1–7.

Suwija, I. (2019). Tingkat-Tingkatan Bicara Bahasa Bali (Dampak Anggah-Ungguh Kruna). Sosiohumaniora, 21(1), 90. https://doi.org/10.24198/sosiohumaniora.v21i1.19507

Valero-Carreras, D., Alcaraz, J., & Landete, M. (2023). Comparing two SVM models through different metrics based on the confusion matrix. Computers & Operations Research, 152, 106131. https://doi.org/10.1016/J.COR.2022.106131

Zhou, H. (2022). Research of Text Classification Based on TF-IDF and CNN-LSTM. Journal of Physics: Conference Series, 2171(1), 012021. https://doi.org/10.1088/1742-6596/2171/1/012021

Zulfikar, W. B., Atmadja, A. R., & Pratama, S. F. (2023). Sentiment analysis on social media against public policy using multinomial naive bayes. Scientific Journal of Informatics, 10(1), 25–34. https://doi.org/10.15294/SJI.V10I1.39952

Downloads

Published

2026-01-30

How to Cite

Artanta Wibawa, P. W., Pramartha, C., I Gusti Ngurah Anom Cahyadi Putra, & Luh Gede Astuti. (2026). Balinese Language Classification on Social Media using Multinomial Naive Bayes Method with TF-IDF. ComTech: Computer, Mathematics and Engineering Applications, 17(1). https://doi.org/10.21512/comtech.v17i1.14132
Abstract 14  .
PDF downloaded 0  .