Penerapan Ensemble Feature Selection dan Klasterisasi Fitur pada Klasifikasi Dokumen Teks
Keywords:feature clustering, classification, ensemble feature selection
An ensemble method is an approach where several classifiers are created from the training data which can be often more accurate than any of the single classifiers, especially if the base classifiers are accurate and different one each other. Menawhile, feature clustering can reduce feature space by joining similar words into one cluster. The objective of this research is to develop a text categorization system that employs feature clustering based on ensemble feature selection. The research methodology consists of text documents preprocessing, feature subspaces generation using the genetic algorithm-based iterative refinement, implementation of base classifiers by applying feature clustering, and classification result integration of each base classifier using both the static selection and majority voting methods. Experimental results show that the computational time consumed in classifying the dataset into 2 and 3 categories using the feature clustering method is 1.18 and 27.04 seconds faster in compared to those that do not employ the feature selection method, respectively. Also, using static selection method, the ensemble feature selection method with genetic algorithm-based iterative refinement produces 10% and 10.66% better accuracy in compared to those produced by the single classifier in classifying the dataset into 2 and 3 categories, respectively. Whilst, using the majority voting method for the same experiment, the similar ensemble method produces 10% and 12% better accuracy than those produced by the single classifier, respectively.
Bauer, E. and Kohavi, R. (1999). An empirical comparison of voting classification algorithms: bagging, boosting, and variants. Journal of Machine Learning, 36 (1, 2), 105-139.
Bennet, P., Dumais, S. T. and Horvitz, E. (2005). The combination of text classifiers using reliability indicators. Journal of Information Retrieval. 8: 67-100
Chen, W., dkk. (2004). Automatic word clustering for text categorization using global information. Proceedings of AIRS ’04, Beijing, China.
Dietterich, T. G. (2001). Ensemble learning methods. Handbook of Brain Theory and Neural Networks, M.A. Arbib (ed.) (2nd ed). Cambridge: MIT Press.
Ho, T. K. (1998). The Random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20 (8), 832-844
Huang, M. L., Hung, Y. H. and Chen, W. Y. (2010). Neural network classifier with entropy based feature selection on breast cancer diagnosis. Journal of Medical System, 34, 865–873.
Katakis, I., Tsoumakas, G. and Vlahavas, I. (2010). Tracking recurring contexts using ensemble classifiers: an application to email filtering. Journal of Knowledge Information System, 22, 371-391.
McCallum, A. and Nigam, K. (1998). A Comparison of event model for naïve bayes text classification. AAAI-98 Workshop on Learning for Text Categorization.
Opitz, D. (1999). Feature Selection for Ensembles. Proc. 16th Conf. on Artificial Intelligence, AAAI, 379-384.
Rokach, L. (2010). Ensemble-based classifier. Journal of Artificial Intelligence Rev, 33, 1-39.
Tsymbal, A. and Puuronen S. (2002). Ensemble Feature Selection with the Simple Bayesian Classification in Medical Diagnostics. Proceedings of the 15 th IEEE Symposium on Computer-Based Medical Systems (CBMS 2002).
Wu, J., Li, M., Yu, L. and Wang, C. (2010). An Ensemble Classifier of Support Vector Machines Used to Predict Protein Structural Classes by Fusing Auto Covariance and Pseudo-Amino Acid Composition. Journal of Protein J, 29, 62-67.
Yang, Y. and Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In D. H. Fisher, editor, Proceedings of ICML-97, 14th International Conference on Machine Learning. pp. 412-420, Nashville, US. San Francisco: Morgan Kaufmann.
Yu, Z., Nam, M.Y., Sedai, S. and Rhee, P.K. (2009). Evolutionary Fusion of a Multi-Classifier System for Efficient Face Recognition. International Journal of Control, Automation, and Systems, 7(1), 33-40.
Authors who publish with this journal agree to the following terms:
a. Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License - Share Alike that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
b. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
c. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.
All articles published Open Access will be immediately and permanently free for everyone to read and download. We are continuously working with our author communities to select the best choice of license options, currently being defined for this journal as follows: