Penerapan Ensemble Feature Selection  dan Klasterisasi Fitur pada Klasifikasi  Dokumen Teks

Mediana Aryuni

doi:10.21512/comtech.v4i1.2745

Authors

Mediana Aryuni Bina Nusantara University

DOI:

https://doi.org/10.21512/comtech.v4i1.2745

Keywords:

feature clustering, classification, ensemble feature selection

Abstract

An ensemble method is an approach where several classifiers are created from the training data which can be often more accurate than any of the single classifiers, especially if the base classifiers are accurate and different one each other. Menawhile, feature clustering can reduce feature space by joining similar words into one cluster. The objective of this research is to develop a text categorization system that employs feature clustering based on ensemble feature selection. The research methodology consists of text documents preprocessing, feature subspaces generation using the genetic algorithm-based iterative refinement, implementation of base classifiers by applying feature clustering, and classification result integration of each base classifier using both the static selection and majority voting methods. Experimental results show that the computational time consumed in classifying the dataset into 2 and 3 categories using the feature clustering method is 1.18 and 27.04 seconds faster in compared to those that do not employ the feature selection method, respectively. Also, using static selection method, the ensemble feature selection method with genetic algorithm-based iterative refinement produces 10% and 10.66% better accuracy in compared to those produced by the single classifier in classifying the dataset into 2 and 3 categories, respectively. Whilst, using the majority voting method for the same experiment, the similar ensemble method produces 10% and 12% better accuracy than those produced by the single classifier, respectively.

Dimensions

Plum Analytics

References

Bauer, E. and Kohavi, R. (1999). An empirical comparison of voting classification algorithms: bagging, boosting, and variants. Journal of Machine Learning, 36 (1, 2), 105-139.

Bennet, P., Dumais, S. T. and Horvitz, E. (2005). The combination of text classifiers using reliability indicators. Journal of Information Retrieval. 8: 67-100

Chen, W., dkk. (2004). Automatic word clustering for text categorization using global information. Proceedings of AIRS â€™04, Beijing, China.

Dietterich, T. G. (2001). Ensemble learning methods. Handbook of Brain Theory and Neural Networks, M.A. Arbib (ed.) (2nd ed). Cambridge: MIT Press.

Ho, T. K. (1998). The Random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20 (8), 832-844

Huang, M. L., Hung, Y. H. and Chen, W. Y. (2010). Neural network classifier with entropy based feature selection on breast cancer diagnosis. Journal of Medical System, 34, 865â€“873.

Katakis, I., Tsoumakas, G. and Vlahavas, I. (2010). Tracking recurring contexts using ensemble classifiers: an application to email filtering. Journal of Knowledge Information System, 22, 371-391.

McCallum, A. and Nigam, K. (1998). A Comparison of event model for naÃ¯ve bayes text classification. AAAI-98 Workshop on Learning for Text Categorization.

Opitz, D. (1999). Feature Selection for Ensembles. Proc. 16th Conf. on Artificial Intelligence, AAAI, 379-384.

Rokach, L. (2010). Ensemble-based classifier. Journal of Artificial Intelligence Rev, 33, 1-39.

Tsymbal, A. and Puuronen S. (2002). Ensemble Feature Selection with the Simple Bayesian Classification in Medical Diagnostics. Proceedings of the 15 th IEEE Symposium on Computer-Based Medical Systems (CBMS 2002).

Wu, J., Li, M., Yu, L. and Wang, C. (2010). An Ensemble Classifier of Support Vector Machines Used to Predict Protein Structural Classes by Fusing Auto Covariance and Pseudo-Amino Acid Composition. Journal of Protein J, 29, 62-67.

Yang, Y. and Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In D. H. Fisher, editor, Proceedings of ICML-97, 14th International Conference on Machine Learning. pp. 412-420, Nashville, US. San Francisco: Morgan Kaufmann.

Yu, Z., Nam, M.Y., Sedai, S. and Rhee, P.K. (2009). Evolutionary Fusion of a Multi-Classifier System for Efficient Face Recognition. International Journal of Control, Automation, and Systems, 7(1), 33-40.