Fuzzy C-Means in Content-Based Document Clustering for Grouping General Websites Based on Their Main Page Contents
DOI:
https://doi.org/10.21512/comtech.v14i2.9732Keywords:
Fuzzy C-Means, content-based document clustering, general websitesAbstract
The research aimed to use Fuzzy C-Means clustering in content-based document clustering to classify general websites based on their content. The data used were a table ranking of the most visited websites for Indonesia, taken from https://dataforseo.com/top-1000-websites/ on September 24th, 2022. The research was conducted with two different cases using Fuzzy C-Means clustering, which had two different iteration parameter values, namely 100 and 200 in maximum iteration. The research results on Fuzzy C-Means clustering in content-based document clustering are based on the two cases. These different maximum iteration parameters result in a different amount of website name data in the cluster. They are formed in the first and second clusters only. However, in the other clusters, the numbers are all the same. The results of the cluster research are validated using the silhouette coefficient, with case no. 1 and no. 2 values being 0,977783879 and 0,977788457. The use of Fuzzy C-Means clustering in content-based document clustering has an excellent performance when this method is applied to group general websites based on their content. With that result, content-based clustering can be also applied in other cases. Hence, the results can be considered to be applied to other cases for content-based clustering in the future.
Plum Analytics
References
Aithal, P. K., Dinesh, A. U., & Geetha, M. (2019). Identifying significant macroeconomic indicators for Indian stock markets. IEEE Access, 7, 143829–143840. https://doi.org/10.1109/ACCESS.2019.2945603
Alam, M. S., Rahman, M. M., Hossain, M. A., Islam, M. K., Ahmed, K. M., Ahmed, K. T., ... & Miah, M. S. (2019). Automatic human brain tumor detection in MRI image using template-based K Means and improved Fuzzy C Means clustering algorithm. Big Data and Cognitive Computing, 3(2), 1–18. https://doi.org/10.3390/bdcc3020027
Al-Ashkar, I., Al-Suhaibani, N., Abdella, K., Sallam, M., Alotaibi, M., & Seleiman, M. F. (2021). Combining genetic and multidimensional analyses to identify interpretive traits related to water shortage tolerance as an indirect selection tool for detecting genotypes of drought tolerance in wheat breeding. Plants, 10(5), 1–23. https://doi.org/10.3390/plants10050931
Andriyan, W., Septiawan, S., & Aulya, A. (2020). Perancangan website sebagai media informasi dan peningkatan citra pada SMK Dewi Sartika Tangerang. Jurnal Teknologi Terpadu, 6(2), 79–88. https://doi.org/10.54914/jtt.v6i2.289
Arora, J., Tushir, M., & Dadhwal, S. K. (2023). A new suppression-based possibilistic Fuzzy C-Means clustering algorithm. EAI Endorsed Transactions on Scalable Information Systems, 10(3), 1–14. https://doi.org/10.4108/eetsis.v10i3.2057
Crnogorac, V., Grbić, M., Đukanović, M., & Matić, D. (2021). Clustering of European countries and territories based on cumulative relative number of COVID 19 patients in 2020. In 2021 20th International Symposium INFOTEH-JAHORINA (INFOTEH) (pp. 1–6). IEEE. https://doi.org/10.1109/INFOTEH51037.2021.9400670
Işik, M., & Dağ, H. (2020). The impact of text preprocessing on the prediction of review ratings. Turkish Journal of Electrical Engineering and Computer Sciences, 28(3), 1405–1421. https://doi.org/10.3906/elk-1907-46
Jujjuri, R. D., & Rao, M. V. (2019). Evaluation of enhanced subspace clustering validity using silhouette coefficient internal measure. Journal of Advanced Research in Dynamical and Control Systems, 11(1), 321–328.
Khang, T. D., Vuong, N. D., Tran, M. K., & Fowler, M. (2020). Fuzzy C-Means clustering algorithm with multiple fuzzification coefficients. Algorithms, 13(7), 1–11. https://doi.org/10.3390/A13070158
Lalang, D., & Lanmay, M. (2022). Aplikasi metode Fuzzy Clustering Means untuk data trending kasus vaksin Corona pada jejaring sosial Twitter. EduMatSains: Jurnal Pendidikan, Matematika dan Sains, 6(2), 431–442. https://doi.org/10.33541/edumatsains.v6i2.3447
Maiolo, M., & Pantusa, D. (2021). Multivariate analysis of water quality data for drinking water supply systems. Water, 13(13), 1–14. https://doi.org/10.3390/w13131766
Maimon, O., & Rokach, L. (Eds.) (2010). Data mining and knowledge discovery handbook. Springer. https://doi.org/10.1007/978-0-387-09823-4
Nurrahman, A., Dimas, M., Ma’sum, M. F., & Ino, M. F. (2021). Pemanfaatan website sebagai bentuk digitalisasi pelayanan publik di Kabupaten Garut. Jurnal Teknologi dan Komunikasi Pemerintahan, 3(1), 78–95. https://doi.org/10.33701/jtkp.v3i1.2126
Oke, J. A., Akinkunmi, W. B., & Etebefia, S. O. (2019). Use of correlation, tolerance and variance inflation factor for multicollinearity test. GSJ, 7(5), 652–659.
Petrović, Đ., & Stanković, M. (2019). The influence of text preprocessing methods and tools on calculating text similarity. Facta Universitatis: Series Mathematics and Informatics, 34(5), 973–994. https://doi.org/10.22190/fumi1905973d
Rajkumar, K. V., Yesubabu, A., & Subrahmanyam, K. (2019). Fuzzy clustering and Fuzzy C-Means partition cluster analysis and validation studies on a subset of CiteScore dataset. International Journal of Electrical and Computer Engineering (IJECE), 9(4), 2760–2770. https://doi.org/10.11591/ijece.v9i4.pp2760-2770
Rohmah, D. S., & Saputro, D. R. S. (2020). Clustering data dengan algoritme Fuzzy C-Means berbasis Indeks Validitas Partition Coefficient and Exponential Separation (PCAES). In PRISMA, Prosiding Seminar Nasional Matematika (pp. 58–63).
Sigit, K., Dewi, A. P., Windu, G., Nurmalasari, Muhamad, T., & Kadinar, N. (2019). Comparison of classification methods on sentiment analysis of political figure electability based on public comments on online news media sites. In IOP Conference Series: Materials Science and Engineering (Vol. 662, No. 4). IOP Publishing. https://doi.org/10.1088/1757-899X/662/4/042003
Teerenstra, S., Taljaard, M., Haenen, A., Huis, A., Atsma, F., Rodwell, L., & Hulscher, M. (2019). Sample size calculation for stepped-wedge cluster-randomized trials with more than two levels of clustering. Clinical Trials, 16(3), 225–236. https://doi.org/10.1177/1740774519829053
Ting, K. D. (2004). Clustering articles in a literature digital library based on content and usage (Master dissertation). National Sun Yat-sen University.
Wang, L., & Jiang, Y. (2022). Collocating recommendation method for e-commerce based on Fuzzy C-Means clustering algorithm. Journal of Mathematics, 2022, 1–11. https://doi.org/10.1155/2022/7414419
Wang, S. (2005). Preference-anchored document clustering technique for supporting effective knowledge and document management (Master dissertation). National Sun Yat-sen University.
Xia, S., Cai, J., Chen, J., Lin, X., Chen, S., Gao, B., & Li, C. (2020). Factor and cluster analysis for TCM syndromes of real-world metabolic syndrome at different age stage. Evidence-Based Complementary and Alternative Medicine, 2020, 1–10. https://doi.org/10.1155/2020/7854325
Yuan, C., & Yang, H. (2019). Research on K-Value selection method of K-Means clustering algorithm. J, 2(2), 226–235. https://doi.org/10.3390/j2020016
Downloads
Published
Issue
Section
License
Copyright (c) 2023 Sri Probo Aditiyo, Eni Sumarminingsih, Rahma Fitriani
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
a. Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License - Share Alike that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
b. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
c. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.
USER RIGHTS
All articles published Open Access will be immediately and permanently free for everyone to read and download. We are continuously working with our author communities to select the best choice of license options, currently being defined for this journal as follows: