Question Categorization using Lexical Feature in

Christian Eka Saputra, Derwin Suhartono, Rini Wongso


This research aimed to categorize questions posted in N-gram and Bag of Concept (BOC) were used as the lexical features. Those were combined with Naïve Bayes, Support Vector Machine (SVM), and J48 Tree as the classification method. The experiments were done by using data from online media portal to categorize questions posted by user. Based on the experiments, the best accuracy is 96,5%. It is obtained by using the combination of Bigram Trigram Keyword (BTK) features with J48 Tree as classifier. Meanwhile, the combination of Unigram Bigram (UB) and Unigram Bigram Keyword (UBK) with attribute selection in WEKA achieves the accuracy of 95,94% by using SVM as the classifier.


text classification, Bag of Concept, Naïve Bayes, Support Vector Machine (SVM), J48 Tree

Full Text:



Desilia, Y., Utami, V. T., Arta, C., & Suhartono, D. (2017). An attempt to combine features in classifying argument components in persuasive essays. In 17th Workshop on Computational Models of Natural Argument (CMNA). London, United Kingdom.

Garcia, M. M., Rodriguez, R. P., & Anido, L. (2015). Bag of concepts document representation for textual news classification. International Journal of Computational Linguistics and Applications, 6(1), 173-188.

Gunawan, A. A. S., Tania, & Suhartono, D. (2016). Recommender system for product offering by personalized email. In 1st International Workshop on Big Data and Information Security (IWBIS). Jakarta, Indonesia.

Hanafi, A., Whidiana, R., & Dayawati, R. N. (2009). Pengenalan bahasa suku bangsa Indonesia berbasis teks menggunakan metode N-Gram (Skripsi). Bandung: Telkom University.

Ikonomakis, M., Kotsiantis, S., & Tampakas, V. (2005). Text classification using machine learning techniques. WSEAS Transactions on Computers, 4(8), 966-974.

Jovita, Linda, Hartawan, A. & Suhartono, D. (2015). Using vector space model in question answering system. Procedia Computer Science, 59, 305-311.

Kaur, G., & Chhabra, A. (2014). Improved J48 classification algorithm for the prediction of diabetes. International Journal of Computer Applications, 98(22), 13-17.

Movementi, S. (2015). unggulkan fitur polling. Retrieved from

Nazief, B., & Adriani, M. (1996). Confixstripping: Approach to stemming algorithm for Bahasa Indonesia. Jakarta: Faculty of Computer Science, University of Indonesia.

Nugroho, A. S., Witarto, A. B., & Handoko, D. (2003). Application of support vector machine in Bioinformatics. In Indonesian Scientific Meeting in Gifu, Central Japan.

Ozer, P. (2008). Data mining algorithm for classification (Bachelor Thesis). Redbound University Nijimegan Permadi, Y. (2008). Kategorisasi teks menggunakan N-Gram untuk dokumen berbahasa Indonesia (Skripsi). Bogor: Institut Pertanian Bogor.

Rahmoun, A., & Elberricihi, Z. (2007). Experimenting N-Grams in text categorization. The International Arab Journal of Information Technology, 4(4), 377-385.

Sahlgren, M., & Coster, R. (2004). Using bag of concepts to improve the performance of support vector machines in text categorization. In Proceedings of the 20th International Conference on Computational Linguistics Article No 487.

Stab, C. & Gurevych, I. (2014). Identifying argumentative discourse structures in persuasive essays. In Conference on Empirical Methods on Natural Language Processing (EMNLP).

Täckström, O. (2005). An evaluation of bag-of-concepts representations in automatic text classification (Master Thesis). Swedia: Royal Institute of Technology Sweden.

Wei, Z., Miao, D., Chauchat, J. H., & Zhong, C. (2008). Feature selection on Chinese text classification using character N-grams. In International Conference on Rough Sets and Knowledge Technology (pp. 500-507). Springer.

Wongso, R., Luwinda, F., Trisnajaya, B., Rusli, O., & Rudy. (2017). News article text classification in Indonesian language. In The 2nd International Conference on Computer Science and Computational Intelligence (ICCSCI 2017) (pp. 137-143). Elsevier.



  • There are currently no refbacks.

ComTech stats

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.