Machine Learning-Based Malicious Website Detection Using Logistic Regression Algorithm

Puan Bening Pastika; Alamsyah Alamsyah

doi:10.21512/emacsjournal.v6i3.11844

Authors

Puan Bening Pastika Universitas Negeri Semarang
Alamsyah Alamsyah Universitas Negeri Semarang

DOI:

https://doi.org/10.21512/emacsjournal.v6i3.11844

Keywords:

Malicious Website, Machine Learning , Logistic Regression

Abstract

Cybercrime is an increasing threat that occurs while exploring the internet. Cybercrime is committed by cybercriminals who exploit the web's vulnerability by inserting malicious software to access systems that belong to web service users. It is detrimental to users, therefore detecting malicious websites is necessary to minimize cybercrime. This research aims to improve the effectiveness of detecting malicious websites by applying the Logistic Regression algorithm. The selection of Logistic Regression is based on its ability to perform binary classification, which is important for distinguishing between benign and potentially malicious websites. This research emphasizes a preprocessing stage that has been deeply optimized. Data cleaning, dataset balancing, and feature mapping are enhanced to improve detection accuracy. Hybrid sampling addresses data imbalance, ensuring the model is trained with representative data from both classes. Experimental results show that the Logistic Regression implementation achieves an excellent level of accuracy. The developed model recorded an accuracy of 92.60% without cross-validation, which increased to 92.71% with 5-fold cross-validation. The novelty of this research lies in the significant increase in accuracy compared to previous methods, demonstrating the potential to improve protection against malicious website threats in an increasingly complex and risky digital environment. This research makes an important contribution to the development of digital security detection technologies to address the ever-growing challenges of cybercrime.

Dimensions

Author Biographies

Puan Bening Pastika, Universitas Negeri Semarang

Computer Science Department, Faculty of Mathematics and Natural Sciences

Alamsyah Alamsyah, Universitas Negeri Semarang

Computer Science Department, Faculty of Mathematics and Natural Sciences

References

A. Saleem Raja, S. Peerbashab, Y. Mohammed Iqbal, B. Sundarvadivazhagan, & M. Mohamed Surputheen. (2023). Structural Analysis of URL For Malicious URL Detection Using Machine Learning. Journal of Advanced Applied Scientific Research, 5(4), 28–41. https://doi.org/10.46947/joaasr542023679

Alban, A. Q., Islam, F., Malluhi, Q. M., & Jaoua, A. (2020). Anomalies Detection in Software by Conceptual Learning From Normal Executions. IEEE Access, 8, 179845–179856. https://doi.org/10.1109/ACCESS.2020.3027508

Alobaid, A., Kacprzak, E., & Corcho, O. (2020). Typology-based semantic labeling of numeric tabular data. Semantic Web, 12(1), 5–20. https://doi.org/10.3233/SW-200397

Alsaedi, M., Ghaleb, F., Saeed, F., Ahmad, J., & Alasli, M. (2022). Cyber Threat Intelligence-Based Malicious URL Detection Model Using Ensemble Learning. Sensors, 22(9), 3373. https://doi.org/10.3390/s22093373

Aprelia Windarni, V., Ferdita Nugraha, A., Tri Atmaja Ramadhani, S., Anisa Istiqomah, D., Mahananing Puri, F., & Setiawan, A. (2023). DETEKSI WEBSITE PHISHING MENGGUNAKAN TEKNIK FILTER PADA MODEL MACHINE LEARNING. In Information System Journal (INFOS) | (Vol. 6, Issue 1).

Barella, V. H., Garcia, L. P. F., de Souto, M. C. P., Lorena, A. C., & de Carvalho, A. C. P. L. F. (2021). Assessing the data complexity of imbalanced datasets. Information Sciences, 553, 83–109. https://doi.org/10.1016/j.ins.2020.12.006

Bates, S., Hastie, T., & Tibshirani, R. (2023). Cross-Validation: What Does It Estimate and How Well Does It Do It? Journal of the American Statistical Association, 1–12. https://doi.org/10.1080/01621459.2023.2197686

Chung, C.-J., & Fabbri, A. G. (2020). Mineral Occurrence Target Mapping: A General Iterative Strategy in Prediction Modeling for Mineral Exploration. Natural Resources Research, 29(1), 115–134. https://doi.org/10.1007/s11053-019-09494-5

Dai, C., Lin, B., Xing, X., & Liu, J. S. (2023). False Discovery Rate Control via Data Splitting. Journal of the American Statistical Association, 118(544), 2503–2520. https://doi.org/10.1080/01621459.2022.2060113

Doshi, B. (2011). Handling Missing Values in Data Mining.

Felix, E. A., & Lee, S. P. (2019). Systematic literature review of preprocessing techniques for imbalanced data. IET Software, 13(6), 479–496. https://doi.org/10.1049/iet-sen.2018.5193

Gao, X., Ren, B., Zhang, H., Sun, B., Li, J., Xu, J., He, Y., & Li, K. (2020). An ensemble imbalanced classification method based on model dynamic selection driven by data partition hybrid sampling. Expert Systems with Applications, 160, 113660. https://doi.org/10.1016/j.eswa.2020.113660

Jalil, S., Usman, M., & Fong, A. (2023). Highly accurate phishing URL detection based on machine learning. Journal of Ambient Intelligence and Humanized Computing, 14(7), 9233–9251. https://doi.org/10.1007/s12652-022-04426-3

Kavici, S., & Ayaz-Alkaya, S. (2024). Internet addiction, social anxiety and body mass index in adolescents: A predictive correlational design. Children and Youth Services Review, 160, 107590. https://doi.org/10.1016/j.childyouth.2024.107590

Liu, H., Chen, S.-M., & Cocea, M. (2019). Subclass-based semi-random data partitioning for improving sample representativeness. Information Sciences, 478, 208–221. https://doi.org/10.1016/j.ins.2018.11.002

Malicious URLs Dataset. (n.d.). Retrieved April 25, 2024, from https://bit.ly/4aigTxf

Mohamad Arifandy, & Septia Ulfa Sunaringtyas. (2021). Rancang Bangun Model Machine Learning untuk Mendeteksi Malicious Webpage dengan Metode Wang, et al. (2017). Info Kripto, 15(2), 63–68. https://doi.org/10.56706/ik.v15i2.3

Mondal, D. K., Singh, B. C., Hu, H., Biswas, S., Alom, Z., & Azim, M. A. (2021). SeizeMaliciousURL: A novel learning approach to detect malicious URLs. Journal of Information Security and Applications, 62, 102967. https://doi.org/10.1016/j.jisa.2021.102967

Prasad, A., & Chandra, S. (2024). PhiUSIIL: A diverse security profile empowered phishing URL detection framework based on similarity index and incremental learning. Computers & Security, 136, 103545. https://doi.org/10.1016/j.cose.2023.103545

Qian, Y., Liang, Y., Li, M., Feng, G., & Shi, X. (2014). A resampling ensemble algorithm for classification of imbalance problems. Neurocomputing, 143, 57–67. https://doi.org/10.1016/j.neucom.2014.06.021

Ramadhan, E. (2023). PENERAPAN LEXICAL FEATURES UNTUK MENGOPTIMASI ALGORITMA RANDOM FOREST DALAM PENDETEKSIAN MALICIOUS URL PADA WEBSITE. UPN “Veteran” Yogyakarta.

Saleem Raja, A., Vinodini, R., & Kavitha, A. (2021). Lexical features based malicious URL detection using machine learning techniques. Materials Today: Proceedings, 47, 163–166. https://doi.org/10.1016/j.matpr.2021.04.041

Shin, S.-S., Ji, S.-G., & Hong, S.-S. (2022). A Heterogeneous Machine Learning Ensemble Framework for Malicious Webpage Detection. Applied Sciences, 12(23), 12070. https://doi.org/10.3390/app122312070

Tashev, I. J., Michael Winters, R., Wang, Y.-T., Johnston, D., Reyes, A., & Estepp, J. (2022). Modelling the Training Process. 2022 IEEE Research and Applications of Photonics in Defense Conference (RAPID), 1–2. https://doi.org/10.1109/RAPID54472.2022.9911274

Tiwari, S. R., & Rana, K. K. (2021). Data Science and Intelligent Applications (Vol. 52).

Utku, A., & Can, U. (2022). Machine Learning-Based Effective Malicious Web Page Detection. International Journal of Information Security Science, 11(4), 28–39.

Vajrobol, V., Gupta, B. B., & Gaurav, A. (2024). Mutual information based logistic regression for phishing URL detection. Cyber Security and Applications, 2, 100044. https://doi.org/10.1016/j.csa.2024.100044

Verma, V. K., Brahma, D., & Rai, P. (2020). Meta-Learning for Generalized Zero-Shot Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04), 6062–6069. https://doi.org/10.1609/aaai.v34i04.6069

Yacouby, R., & Axman, D. (2020). Probabilistic Extension of Precision, Recall, and F1 Score for More Thorough Evaluation of Classification Models. Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, 79–91. https://doi.org/10.18653/v1/2020.eval4nlp-1.9

Yu, L., Chen, L., Dong, J., Li, M., Liu, L., Zhao, B., & Zhang, C. (2020). Detecting Malicious Web Requests Using an Enhanced TextCNN. 2020 IEEE 44th Annual Computers, Software, and Applications Conference (COMPSAC), 768–777. https://doi.org/10.1109/COMPSAC48688.2020.0-167