Fuzzy C-Means in Content-Based Document Clustering for Grouping General Websites Based on Their Main Page Contents

Sri Probo Aditiyo; Eni Sumarminingsih; Rahma Fitriani

doi:10.21512/comtech.v14i2.9732

Authors

Sri Probo Aditiyo Brawijaya University
Eni Sumarminingsih Brawijaya University
Rahma Fitriani Brawijaya University

DOI:

https://doi.org/10.21512/comtech.v14i2.9732

Keywords:

Fuzzy C-Means, content-based document clustering, general websites

Abstract

The research aimed to use Fuzzy C-Means clustering in content-based document clustering to classify general websites based on their content. The data used were a table ranking of the most visited websites for Indonesia, taken from https://dataforseo.com/top-1000-websites/ on September 24^th, 2022. The research was conducted with two different cases using Fuzzy C-Means clustering, which had two different iteration parameter values, namely 100 and 200 in maximum iteration. The research results on Fuzzy C-Means clustering in content-based document clustering are based on the two cases. These different maximum iteration parameters result in a different amount of website name data in the cluster. They are formed in the first and second clusters only. However, in the other clusters, the numbers are all the same. The results of the cluster research are validated using the silhouette coefficient, with case no. 1 and no. 2 values being 0,977783879 and 0,977788457. The use of Fuzzy C-Means clustering in content-based document clustering has an excellent performance when this method is applied to group general websites based on their content. With that result, content-based clustering can be also applied in other cases. Hence, the results can be considered to be applied to other cases for content-based clustering in the future.

Dimensions

Plum Analytics

Author Biographies

Sri Probo Aditiyo, Brawijaya University

Department of Statistics, Mathematics and Natural Science Faculty

Eni Sumarminingsih, Brawijaya University

Department of Statistics, Mathematics and Natural Science Faculty

Rahma Fitriani, Brawijaya University

Department of Statistics, Mathematics and Natural Science Faculty

References

Aithal, P. K., Dinesh, A. U., & Geetha, M. (2019). Identifying significant macroeconomic indicators for Indian stock markets. IEEE Access, 7, 143829â€“143840. https://doi.org/10.1109/ACCESS.2019.2945603

Alam, M. S., Rahman, M. M., Hossain, M. A., Islam, M. K., Ahmed, K. M., Ahmed, K. T., ... & Miah, M. S. (2019). Automatic human brain tumor detection in MRI image using template-based K Means and improved Fuzzy C Means clustering algorithm. Big Data and Cognitive Computing, 3(2), 1â€“18. https://doi.org/10.3390/bdcc3020027

Al-Ashkar, I., Al-Suhaibani, N., Abdella, K., Sallam, M., Alotaibi, M., & Seleiman, M. F. (2021). Combining genetic and multidimensional analyses to identify interpretive traits related to water shortage tolerance as an indirect selection tool for detecting genotypes of drought tolerance in wheat breeding. Plants, 10(5), 1â€“23. https://doi.org/10.3390/plants10050931

Andriyan, W., Septiawan, S., & Aulya, A. (2020). Perancangan website sebagai media informasi dan peningkatan citra pada SMK Dewi Sartika Tangerang. Jurnal Teknologi Terpadu, 6(2), 79â€“88. https://doi.org/10.54914/jtt.v6i2.289

Arora, J., Tushir, M., & Dadhwal, S. K. (2023). A new suppression-based possibilistic Fuzzy C-Means clustering algorithm. EAI Endorsed Transactions on Scalable Information Systems, 10(3), 1â€“14. https://doi.org/10.4108/eetsis.v10i3.2057

Crnogorac, V., GrbiÄ‡, M., ÄukanoviÄ‡, M., & MatiÄ‡, D. (2021). Clustering of European countries and territories based on cumulative relative number of COVID 19 patients in 2020. In 2021 20th International Symposium INFOTEH-JAHORINA (INFOTEH) (pp. 1â€“6). IEEE. https://doi.org/10.1109/INFOTEH51037.2021.9400670

IÅŸik, M., & DaÄŸ, H. (2020). The impact of text preprocessing on the prediction of review ratings. Turkish Journal of Electrical Engineering and Computer Sciences, 28(3), 1405â€“1421. https://doi.org/10.3906/elk-1907-46

Jujjuri, R. D., & Rao, M. V. (2019). Evaluation of enhanced subspace clustering validity using silhouette coefficient internal measure. Journal of Advanced Research in Dynamical and Control Systems, 11(1), 321â€“328.

Khang, T. D., Vuong, N. D., Tran, M. K., & Fowler, M. (2020). Fuzzy C-Means clustering algorithm with multiple fuzzification coefficients. Algorithms, 13(7), 1â€“11. https://doi.org/10.3390/A13070158

Lalang, D., & Lanmay, M. (2022). Aplikasi metode Fuzzy Clustering Means untuk data trending kasus vaksin Corona pada jejaring sosial Twitter. EduMatSains: Jurnal Pendidikan, Matematika dan Sains, 6(2), 431â€“442. https://doi.org/10.33541/edumatsains.v6i2.3447

Maiolo, M., & Pantusa, D. (2021). Multivariate analysis of water quality data for drinking water supply systems. Water, 13(13), 1â€“14. https://doi.org/10.3390/w13131766

Maimon, O., & Rokach, L. (Eds.) (2010). Data mining and knowledge discovery handbook. Springer. https://doi.org/10.1007/978-0-387-09823-4

Nurrahman, A., Dimas, M., Maâ€™sum, M. F., & Ino, M. F. (2021). Pemanfaatan website sebagai bentuk digitalisasi pelayanan publik di Kabupaten Garut. Jurnal Teknologi dan Komunikasi Pemerintahan, 3(1), 78â€“95. https://doi.org/10.33701/jtkp.v3i1.2126

Oke, J. A., Akinkunmi, W. B., & Etebefia, S. O. (2019). Use of correlation, tolerance and variance inflation factor for multicollinearity test. GSJ, 7(5), 652â€“659.

PetroviÄ‡, Ä., & StankoviÄ‡, M. (2019). The influence of text preprocessing methods and tools on calculating text similarity. Facta Universitatis: Series Mathematics and Informatics, 34(5), 973â€“994. https://doi.org/10.22190/fumi1905973d

Rajkumar, K. V., Yesubabu, A., & Subrahmanyam, K. (2019). Fuzzy clustering and Fuzzy C-Means partition cluster analysis and validation studies on a subset of CiteScore dataset. International Journal of Electrical and Computer Engineering (IJECE), 9(4), 2760â€“2770. https://doi.org/10.11591/ijece.v9i4.pp2760-2770

Rohmah, D. S., & Saputro, D. R. S. (2020). Clustering data dengan algoritme Fuzzy C-Means berbasis Indeks Validitas Partition Coefficient and Exponential Separation (PCAES). In PRISMA, Prosiding Seminar Nasional Matematika (pp. 58â€“63).

Sigit, K., Dewi, A. P., Windu, G., Nurmalasari, Muhamad, T., & Kadinar, N. (2019). Comparison of classification methods on sentiment analysis of political figure electability based on public comments on online news media sites. In IOP Conference Series: Materials Science and Engineering (Vol. 662, No. 4). IOP Publishing. https://doi.org/10.1088/1757-899X/662/4/042003

Teerenstra, S., Taljaard, M., Haenen, A., Huis, A., Atsma, F., Rodwell, L., & Hulscher, M. (2019). Sample size calculation for stepped-wedge cluster-randomized trials with more than two levels of clustering. Clinical Trials, 16(3), 225â€“236. https://doi.org/10.1177/1740774519829053

Ting, K. D. (2004). Clustering articles in a literature digital library based on content and usage (Master dissertation). National Sun Yat-sen University.

Wang, L., & Jiang, Y. (2022). Collocating recommendation method for e-commerce based on Fuzzy C-Means clustering algorithm. Journal of Mathematics, 2022, 1â€“11. https://doi.org/10.1155/2022/7414419

Wang, S. (2005). Preference-anchored document clustering technique for supporting effective knowledge and document management (Master dissertation). National Sun Yat-sen University.

Xia, S., Cai, J., Chen, J., Lin, X., Chen, S., Gao, B., & Li, C. (2020). Factor and cluster analysis for TCM syndromes of real-world metabolic syndrome at different age stage. Evidence-Based Complementary and Alternative Medicine, 2020, 1â€“10. https://doi.org/10.1155/2020/7854325

Yuan, C., & Yang, H. (2019). Research on K-Value selection method of K-Means clustering algorithm. J, 2(2), 226â€“235. https://doi.org/10.3390/j2020016