Balinese Language Classification on Social Media using Multinomial Naive Bayes Method with TF-IDF

Putu Widyantara Artanta Wibawa; Cokorda Pramartha; I Gusti Ngurah Anom Cahyadi Putra; Luh Gede Astuti

doi:10.21512/comtech.v17i1.14132

Authors

Putu Widyantara Artanta Wibawa Informatics Study Program, Faculty of Mathematics and Natural Sciences, Udayana University, Bali, Indonesia 80361 https://orcid.org/0009-0001-7471-6680
Cokorda Pramartha Informatics Study Program, Faculty of Mathematics and Natural Sciences, Udayana University, Bali, Indonesia 80361 https://orcid.org/0000-0002-2835-3989
I Gusti Ngurah Anom Cahyadi Putra Informatics Study Program, Faculty of Mathematics and Natural Sciences, Udayana University, Bali, Indonesia 80361 https://orcid.org/0000-0001-8408-7091
Luh Gede Astuti Informatics Study Program, Faculty of Mathematics and Natural Sciences, Udayana University, Bali, Indonesia 80361 https://orcid.org/0009-0006-9571-199X

DOI:

https://doi.org/10.21512/comtech.v17i1.14132

Keywords:

classification, Balinese language, TF-IDF, Chi-square, SMOTE, Multinomial Naive Bayes

Abstract

Balinese is a local language that is widely used and spoken by Balinese people, including on social media platforms. However, the nuances of its politeness levels are often lost in informal digital communication, and there is a significant lack of computational models that automatically classify these levels, particularly for low-resource languages such as Balinese. The primary objective of this study is to evaluate the performance of the Multinomial Naive Bayes method combined with Term Frequency–Inverse Document Frequency (TFIDF) feature extraction, Chi-square feature selection, and the Synthetic Minority Oversampling Technique (SMOTE) in classifying Balinese language levels. The dataset used in this study consists of 1,314 annotated social media posts and comments, primarily sourced from Instagram. A Balinese language expert performs the annotation, categorizing the texts into six levels that represent varying degrees of politeness and formality. These levels include alus singgih (polite, used for respecting others), alus sor (polite, used for self-humbling), alus mider (polite, used for both respecting others and self-humbling), alus madia (an intermediate level of politeness), basa andap (casual, commonly used in everyday life), and basa kasar (impolite, often used during arguments or toward animals). The experimental results show that the model achieves 96.53% accuracy on the training data and 61.45% accuracy on the test data. In addition, hyperparameter tuning reveals that the Multinomial Naive Bayes model with 2,720 selected features and SMOTE oversampling achieves 91.78% accuracy, significantly outperforming the baseline model without feature selection or oversampling, which achieves only 64.93% accuracy.

Dimensions

References

Agus, M., Subali, P., & Fatichah, C. (2019). Kombinasi metode Rule-Based dan N-Gram Stemming untuk mengenali stemmer bahasa Bali. Jurnal Teknologi Informasi Dan Ilmu Komputer, 6(2), 219–228. https://doi.org/10.25126/JTIIK.2019621105

Angeline, G., Wibawa, A. P., & Pujianto, U. (2022). Klasifikasi dialek bahasa Jawa menggunakan metode Naives Bayes. Jurnal Mnemonic, 5(2), 103–110. https://doi.org/10.36040/mnemonic.v5i2.4748

Ardhana, A. P. (2018). Klasifikasi Tingkatan Bahasa pada Artikel Berbahasa Jawa dengan Metode Multinomial Naïve Bayes.[Under Graduate Thesis, Universitas Sebelas Maret]. UNS Institutional Repository.

Azad, R., Mohammed, B., Mahmud, R., Zrar, L., & Sdiq, S. (2021). Fake News Detection in low-resourced languages “Kurdish language” using Machine learning algorithms. Turkish Journal of Computer and Mathematics Education, 12(6), 4219–4225.

Bahassine, S., Madani, A., Al-Sarem, M., & Kissi, M. (2020). Feature selection using an improved Chisquare for Arabic text classification. Journal of King Saud University - Computer and Information Sciences, 32(2), 225–231. https://doi.org/10.1016/J.JKSUCI.2018.05.010

Chen, W., Yang, K., Yu, Z., Shi, Y., & Chen, C. L. P. (2024). A survey on imbalanced learning: Latest research, applications and future directions. Artificial Intelligence Review, 57(6), 1–51. https://doi.org/10.1007/S10462-024-10759-6/FIGURES/11

Damanik, F. J., & Setyohadi, D. B. (2021). Analysis of public sentiment about Covid-19 in Indonesia on twitter using multinomial Naive Bayes and Support Vector Machine. IOP Conference Series: Earth and Environmental Science, 704(1), 012027. https://doi.org/10.1088/1755-1315/704/1/012027

Dewi, D. A. E. R., & Putra, A. A. N. M. A. (2021). Kebencian pada bahasa Bali dengan metode Naive Bayes. Jurnal Teknologi Informasi Dan Komputer, 7(2).

Gifari, O. I., Adha, M., Rifky Hendrawan, I., Freddy, F., & Durrand, S. (2022). Film review sentiment analysis using TF-IDF and Support Vector Machine. Journal of Information Technology, 2(1), 36–40. https://doi.org/10.46229/JIFOTECH.V2I1.330

Hamzah, M. B. (2021). Classification of movie review sentiment analysis using Chi-Square and Multinomial Naive Bayes with Adaptive Boosting. Journal of Advances in Information Systems and Technology, 3(1), 67–74. https://doi.org/10.15294/JAIST.V3I1.49098

Hickman, L., Thapa, S., Tay, L., Cao, M., & Srinivasan, P. (2022). Text Preprocessing for Text Mining in Organizational Research: Review and Recommendations. Organizational Research Methods, 25(1), 114–146. https://doi.org/10.1177/1094428120971683

Hossain, R., & Timmer, D. (2021). Machine learning model optimization with hyper parameter Tuning Approach. Glob. J. Comput. Sci. Technol. D Neural Artif. Intell, 21(2).

Made, I., Wirawan, W., & Pramartha, C. (2022). Pengembangan sistem informasi penanganan penderita gangguan jiwa dengan pendekatan Enteprise Systems. Sintech (Science and Information Technology) Journal, 5(1), 31–41. https://doi.org/10.31598/SINTECHJOURNAL.V5I1.1070

Mastini, G. N., Kantriani, N. K., & Arini, N. W. (2021). Peran media sosial instagram dalam upaya menjaga eksistensi bahasa Bali. Ganaya : Jurnal Ilmu Sosial Dan Humaniora, 4(2), 686–695. https://doi.org/10.37329/ganaya.v4i2.1414

Nti, I. K., Nyarko-Boateng, O., & Aning, J. (2021). Performance of machine learning algorithms with different K Values in K-fold CrossValidation. International Journal of Information Technology and Computer Science, 13(6), 61–71. https://doi.org/10.5815/IJITCS.2021.06.05

Nugraha, P. G. S. C., & Wardani, N. W. (2020). Stemming dokumen teks bahasa bali dengan metode rule base approach. JATISI, 7(3), 510–521. https://doi.org/10.35957/JATISI.V7I3.538

Pramartha, C., Made, I., Mahendra, Y., Primahadi, G., Rajeg, W., & Arka, W. (2023). The development of semantic dictionary prototype for the Balinese Language. International Journal of Cyber and IT Service Management (IJCITSM), 3(2), 96–106. https://doi.org/10.34306/IJCITSM.V3I2.132

Sosiawan, P., Martha, I. N., & Artika, I. W. (2021). Penggunaan bahasa bali pada keluarga muda di kota Singaraja. Jurnal Pendidikan Dan Pembelajaran Bahasa Indonesia, 10(1), 40–54. https://doi.org/10.23887/JURNAL_BAHASA.V10I1.403

Raza, M. O., Mahoto, N. A., Shaikh, A., Pathan, N., Alshahrani, H., & Elmagzoub, M. A. (2025). A Machine Learning Approach of text classification for high-and low-resource languages. Computational Intelligence, 41(4), e70114. https://doi.org/10.1111/COIN.70114

Suardiana, I. W. (2019). Bahasa Bali dan pemertahanan kearifan Lokal. Linguistika, 19(1), 1–7.

Suwija, I. (2019). Tingkat-tingkatan bicara bahasa bali (dampak anggah-ungguh kruna). Sosiohumaniora, 21(1), 90. https://doi.org/10.24198/sosiohumaniora. v21i1.19507

Valero-Carreras, D., Alcaraz, J., & Landete, M. (2023). Comparing two SVM models through different metrics based on the confusion matrix. Computers & Operations Research, 152, 106131. https://doi.org/10.1016/J.COR.2022.106131

Zhou, H. (2022). Research of text classification based on TF-IDF and CNN-LSTM. Journal of Physics: Conference Series, 2171(1), 012021. https://doi.org/10.1088/1742-6596/2171/1/012021

Zulfikar, W. B., Atmadja, A. R., & Pratama, S. F. (2023). Sentiment analysis on social media against public policy using multinomial naive bayes. Scientific Journal of Informatics, 10(1), 25–34. https://doi.org/10.15294/SJI.V10I1.39952