Balinese Language Classification on Social Media using Multinomial Naive Bayes Method with TF-IDF
DOI:
https://doi.org/10.21512/comtech.v17i1.14132Keywords:
classification, Balinese language, TF-IDF, Chi-square, SMOTE, Multinomial Naive BayesAbstract
Balinese language is a local language that is widely use and spoken by Balinese people including in social media. However, the nuances of these politeness levels are often lost in informal digital communication and there is a significant lack of computational model to automatically classify them, especially for low-resource language like Balinese. The primary objective of this study is to evaluate the performance of the Multinomial Naive Bayes method combined with Term Frequency-Inverse Document Frequency (TF-IDF) feature extraction, Chi-square feature selection, and Synthetic Minority Oversampling Technique (SMOTE) in classifying Balinese language levels. The dataset for this study consists of 1,314 annotated social media posts and comments, primarily sourced from Instagram. The annotation was conducted by a Balinese language expert to categorize text into six levels that represent varying degrees of politeness and formality. These levels are alus singgih (polite, used for respecting others), alus sor (polite, used for self-humbling), alus mider (polite, used for both respecting others and self-humbling), alus madia (an intermediate level of politeness), basa andap (casual, commonly used in everyday life), and basa kasar (impolite, often used during arguments or toward animals). The experimental results showed that the model successfully achieved an accuracy of 96.53% on the training data and 61.45% on the test data. Additionally, hyperparameter tuning revealed that the Multinomial Naive Bayes model with 2,720 selected features and SMOTE oversampling achieved an accuracy of 91.78%, significantly outperforming the baseline model without feature selection and oversampling, which obtained only 64.93% accuracy.
References
Agus, M., Subali, P., & Fatichah, C. (2019). Kombinasi Metode Rule-Based dan N-Gram Stemming untuk Mengenali Stemmer Bahasa Bali. Jurnal Teknologi Informasi Dan Ilmu Komputer, 6(2), 219–228. https://doi.org/10.25126/JTIIK.2019621105
Angeline, G., Wibawa, A. P., & Pujianto, U. (2022). Klasifikasi Dialek Bahasa Jawa Menggunakan Metode Naives Bayes. Jurnal Mnemonic, 5(2), 103–110. https://doi.org/10.36040/mnemonic.v5i2.4748
Ardhana, A. P. (2018). Klasifikasi Tingkatan Bahasa pada Artikel Berbahasa Jawa dengan Metode Multinomial Naïve Bayes.
Azad, R., Mohammed, B., Mahmud, R., Zrar, L., & Sdiq, S. (2021). Fake News Detection in low-resourced languages “Kurdish language” using Machine learning algorithms. Turkish Journal of Computer and Mathematics Education, 12(6), 4219–4225.
Bahassine, S., Madani, A., Al-Sarem, M., & Kissi, M. (2020). Feature selection using an improved Chi-square for Arabic text classification. Journal of King Saud University - Computer and Information Sciences, 32(2), 225–231. https://doi.org/10.1016/J.JKSUCI.2018.05.010
Chen, W., Yang, K., Yu, Z., Shi, Y., & Chen, C. L. P. (2024). A survey on imbalanced learning: latest research, applications and future directions. Artificial Intelligence Review, 57(6), 1–51. https://doi.org/10.1007/S10462-024-10759-6/FIGURES/11
Damanik, F. J., & Setyohadi, D. B. (2021). Analysis Of Public Sentiment About Covid-19 In Indonesia On Twitter Using Multinomial Naive Bayes And Support Vector Machine. IOP Conference Series: Earth and Environmental Science, 704(1), 012027. https://doi.org/10.1088/1755-1315/704/1/012027
Dewi, D. A. E. R., & Putra, A. A. N. M. A. (2021). Kebencian Pada Bahasa Bali Dengan Metode Naive Bayes. Jurnal Teknologi Informasi Dan Komputer, 7(2).
Gifari, O. I., Adha, M., Rifky Hendrawan, I., Freddy, F., & Durrand, S. (2022). Film Review Sentiment Analysis Using TF-IDF and Support Vector Machine. Journal of Information Technology, 2(1), 36–40. https://doi.org/10.46229/JIFOTECH.V2I1.330
Hamzah, M. B. (2021). Classification of Movie Review Sentiment Analysis Using Chi-Square and Multinomial Naïve Bayes with Adaptive Boosting. Journal of Advances in Information Systems and Technology, 3(1), 67–74. https://doi.org/10.15294/JAIST.V3I1.49098
Hickman, L., Thapa, S., Tay, L., Cao, M., & Srinivasan, P. (2022). Text Preprocessing for Text Mining in Organizational Research: Review and Recommendations. Organizational Research Methods, 25(1), 114–146. https://doi.org/10.1177/1094428120971683;WEBSITE:WEBSITE:SAGE;JOURNAL:JOURNAL:ORMA;REQUESTEDJOURNAL:JOURNAL:ORMA;WGROUP:STRING:PUBLICATION
Hossain, R., & Timmer, D. (2021). Machine Learning Model Optimization with Hyper Parameter Tuning Approach. Glob. J. Comput. Sci. Technol. D Neural Artif. Intell, 21(2).
Made, I., Wirawan, W., & Pramartha, C. (2022). PENGEMBANGAN SISTEM INFORMASI PENANGANAN PENDERITA GANGGUAN JIWA DENGAN PENDEKATAN ENTEPRISE SYSTEMS. SINTECH (Science and Information Technology) Journal, 5(1), 31–41. https://doi.org/10.31598/SINTECHJOURNAL.V5I1.1070
Mastini, G. N., Kantriani, N. K., & Arini, N. W. (2021). Peran Media Sosial Instagram Dalam Upaya Menjaga Eksistensi Bahasa Bali. Ganaya : Jurnal Ilmu Sosial Dan Humaniora, 4(2), 686–695. https://doi.org/10.37329/ganaya.v4i2.1414
Nti, I. K., Nyarko-Boateng, O., & Aning, J. (2021). Performance of Machine Learning Algorithms with Different K Values in K-fold CrossValidation. International Journal of Information Technology and Computer Science, 13(6), 61–71. https://doi.org/10.5815/IJITCS.2021.06.05
Nugraha, P. G. S. C., & Wardani, N. W. (2020). STEMMING DOKUMEN TEKS BAHASA BALI DENGAN METODE RULE BASE APPROACH. JATISI, 7(3), 510–521. https://doi.org/10.35957/JATISI.V7I3.538
Pramartha, C., Made, I., Mahendra, Y., Primahadi, G., Rajeg, W., & Arka, W. (2023). The Development of Semantic Dictionary Prototype for the Balinese Language. International Journal of Cyber and IT Service Management (IJCITSM), 3(2), 96–106. https://doi.org/10.34306/IJCITSM.V3I2.132
Sosiawan, P., Martha, I. N., & Artika, I. W. (2021). PENGGUNAAN BAHASA BALI PADA KELUARGA MUDA DI KOTA SINGARAJA. Jurnal Pendidikan Dan Pembelajaran Bahasa Indonesia, 10(1), 40–54. https://doi.org/10.23887/JURNAL_BAHASA.V10I1.403
Raza, M. O., Mahoto, N. A., Shaikh, A., Pathan, N., Alshahrani, H., & Elmagzoub, M. A. (2025). A Machine Learning Approach of Text Classification for High- and Low-Resource Languages. Computational Intelligence, 41(4), e70114. https://doi.org/10.1111/COIN.70114
Suardiana, I. W. (2019). Bahasa Bali dan Pemertahanan Kearifan Lokal. Linguistika, 19(1), 1–7.
Suwija, I. (2019). Tingkat-Tingkatan Bicara Bahasa Bali (Dampak Anggah-Ungguh Kruna). Sosiohumaniora, 21(1), 90. https://doi.org/10.24198/sosiohumaniora.v21i1.19507
Valero-Carreras, D., Alcaraz, J., & Landete, M. (2023). Comparing two SVM models through different metrics based on the confusion matrix. Computers & Operations Research, 152, 106131. https://doi.org/10.1016/J.COR.2022.106131
Zhou, H. (2022). Research of Text Classification Based on TF-IDF and CNN-LSTM. Journal of Physics: Conference Series, 2171(1), 012021. https://doi.org/10.1088/1742-6596/2171/1/012021
Zulfikar, W. B., Atmadja, A. R., & Pratama, S. F. (2023). Sentiment analysis on social media against public policy using multinomial naive bayes. Scientific Journal of Informatics, 10(1), 25–34. https://doi.org/10.15294/SJI.V10I1.39952
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Putu Widyantara Artanta Wibawa, Cokorda Pramartha, I Gusti Ngurah Anom Cahyadi Putra, Luh Gede Astuti

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
a. Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License - Share Alike that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
b. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
c. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.
 USER RIGHTS
 All articles published Open Access will be immediately and permanently free for everyone to read and download. We are continuously working with our author communities to select the best choice of license options, currently being defined for this journal as follows:

















