Machine Learning-Based Malicious Website Detection Using Logistic Regression Algorithm
DOI:
https://doi.org/10.21512/emacsjournal.v6i3.11844Keywords:
Malicious Website, Machine Learning , Logistic RegressionAbstract
Cybercrime is an increasing threat that occurs while exploring the internet. Cybercrime is committed by cybercriminals who exploit the web's vulnerability by inserting malicious software to access systems that belong to web service users. It is detrimental to users, therefore detecting malicious websites is necessary to minimize cybercrime. This research aims to improve the effectiveness of detecting malicious websites by applying the Logistic Regression algorithm. The selection of Logistic Regression is based on its ability to perform binary classification, which is important for distinguishing between benign and potentially malicious websites. This research emphasizes a preprocessing stage that has been deeply optimized. Data cleaning, dataset balancing, and feature mapping are enhanced to improve detection accuracy. Hybrid sampling addresses data imbalance, ensuring the model is trained with representative data from both classes. Experimental results show that the Logistic Regression implementation achieves an excellent level of accuracy. The developed model recorded an accuracy of 92.60% without cross-validation, which increased to 92.71% with 5-fold cross-validation. The novelty of this research lies in the significant increase in accuracy compared to previous methods, demonstrating the potential to improve protection against malicious website threats in an increasingly complex and risky digital environment. This research makes an important contribution to the development of digital security detection technologies to address the ever-growing challenges of cybercrime.
Plum Analytics
References
A. Saleem Raja, S. Peerbashab, Y. Mohammed Iqbal, B. Sundarvadivazhagan, & M. Mohamed Surputheen. (2023). Structural Analysis of URL For Malicious URL Detection Using Machine Learning. Journal of Advanced Applied Scientific Research, 5(4), 28–41. https://doi.org/10.46947/joaasr542023679
Alban, A. Q., Islam, F., Malluhi, Q. M., & Jaoua, A. (2020). Anomalies Detection in Software by Conceptual Learning From Normal Executions. IEEE Access, 8, 179845–179856. https://doi.org/10.1109/ACCESS.2020.3027508
Alobaid, A., Kacprzak, E., & Corcho, O. (2020). Typology-based semantic labeling of numeric tabular data. Semantic Web, 12(1), 5–20. https://doi.org/10.3233/SW-200397
Alsaedi, M., Ghaleb, F., Saeed, F., Ahmad, J., & Alasli, M. (2022). Cyber Threat Intelligence-Based Malicious URL Detection Model Using Ensemble Learning. Sensors, 22(9), 3373. https://doi.org/10.3390/s22093373
Aprelia Windarni, V., Ferdita Nugraha, A., Tri Atmaja Ramadhani, S., Anisa Istiqomah, D., Mahananing Puri, F., & Setiawan, A. (2023). DETEKSI WEBSITE PHISHING MENGGUNAKAN TEKNIK FILTER PADA MODEL MACHINE LEARNING. In Information System Journal (INFOS) | (Vol. 6, Issue 1).
Barella, V. H., Garcia, L. P. F., de Souto, M. C. P., Lorena, A. C., & de Carvalho, A. C. P. L. F. (2021). Assessing the data complexity of imbalanced datasets. Information Sciences, 553, 83–109. https://doi.org/10.1016/j.ins.2020.12.006
Bates, S., Hastie, T., & Tibshirani, R. (2023). Cross-Validation: What Does It Estimate and How Well Does It Do It? Journal of the American Statistical Association, 1–12. https://doi.org/10.1080/01621459.2023.2197686
Chung, C.-J., & Fabbri, A. G. (2020). Mineral Occurrence Target Mapping: A General Iterative Strategy in Prediction Modeling for Mineral Exploration. Natural Resources Research, 29(1), 115–134. https://doi.org/10.1007/s11053-019-09494-5
Dai, C., Lin, B., Xing, X., & Liu, J. S. (2023). False Discovery Rate Control via Data Splitting. Journal of the American Statistical Association, 118(544), 2503–2520. https://doi.org/10.1080/01621459.2022.2060113
Doshi, B. (2011). Handling Missing Values in Data Mining.
Felix, E. A., & Lee, S. P. (2019). Systematic literature review of preprocessing techniques for imbalanced data. IET Software, 13(6), 479–496. https://doi.org/10.1049/iet-sen.2018.5193
Gao, X., Ren, B., Zhang, H., Sun, B., Li, J., Xu, J., He, Y., & Li, K. (2020). An ensemble imbalanced classification method based on model dynamic selection driven by data partition hybrid sampling. Expert Systems with Applications, 160, 113660. https://doi.org/10.1016/j.eswa.2020.113660
Jalil, S., Usman, M., & Fong, A. (2023). Highly accurate phishing URL detection based on machine learning. Journal of Ambient Intelligence and Humanized Computing, 14(7), 9233–9251. https://doi.org/10.1007/s12652-022-04426-3
Kavici, S., & Ayaz-Alkaya, S. (2024). Internet addiction, social anxiety and body mass index in adolescents: A predictive correlational design. Children and Youth Services Review, 160, 107590. https://doi.org/10.1016/j.childyouth.2024.107590
Liu, H., Chen, S.-M., & Cocea, M. (2019). Subclass-based semi-random data partitioning for improving sample representativeness. Information Sciences, 478, 208–221. https://doi.org/10.1016/j.ins.2018.11.002
Malicious URLs Dataset. (n.d.). Retrieved April 25, 2024, from https://bit.ly/4aigTxf
Mohamad Arifandy, & Septia Ulfa Sunaringtyas. (2021). Rancang Bangun Model Machine Learning untuk Mendeteksi Malicious Webpage dengan Metode Wang, et al. (2017). Info Kripto, 15(2), 63–68. https://doi.org/10.56706/ik.v15i2.3
Mondal, D. K., Singh, B. C., Hu, H., Biswas, S., Alom, Z., & Azim, M. A. (2021). SeizeMaliciousURL: A novel learning approach to detect malicious URLs. Journal of Information Security and Applications, 62, 102967. https://doi.org/10.1016/j.jisa.2021.102967
Prasad, A., & Chandra, S. (2024). PhiUSIIL: A diverse security profile empowered phishing URL detection framework based on similarity index and incremental learning. Computers & Security, 136, 103545. https://doi.org/10.1016/j.cose.2023.103545
Qian, Y., Liang, Y., Li, M., Feng, G., & Shi, X. (2014). A resampling ensemble algorithm for classification of imbalance problems. Neurocomputing, 143, 57–67. https://doi.org/10.1016/j.neucom.2014.06.021
Ramadhan, E. (2023). PENERAPAN LEXICAL FEATURES UNTUK MENGOPTIMASI ALGORITMA RANDOM FOREST DALAM PENDETEKSIAN MALICIOUS URL PADA WEBSITE. UPN “Veteran” Yogyakarta.
Saleem Raja, A., Vinodini, R., & Kavitha, A. (2021). Lexical features based malicious URL detection using machine learning techniques. Materials Today: Proceedings, 47, 163–166. https://doi.org/10.1016/j.matpr.2021.04.041
Shin, S.-S., Ji, S.-G., & Hong, S.-S. (2022). A Heterogeneous Machine Learning Ensemble Framework for Malicious Webpage Detection. Applied Sciences, 12(23), 12070. https://doi.org/10.3390/app122312070
Tashev, I. J., Michael Winters, R., Wang, Y.-T., Johnston, D., Reyes, A., & Estepp, J. (2022). Modelling the Training Process. 2022 IEEE Research and Applications of Photonics in Defense Conference (RAPID), 1–2. https://doi.org/10.1109/RAPID54472.2022.9911274
Tiwari, S. R., & Rana, K. K. (2021). Data Science and Intelligent Applications (Vol. 52).
Utku, A., & Can, U. (2022). Machine Learning-Based Effective Malicious Web Page Detection. International Journal of Information Security Science, 11(4), 28–39.
Vajrobol, V., Gupta, B. B., & Gaurav, A. (2024). Mutual information based logistic regression for phishing URL detection. Cyber Security and Applications, 2, 100044. https://doi.org/10.1016/j.csa.2024.100044
Verma, V. K., Brahma, D., & Rai, P. (2020). Meta-Learning for Generalized Zero-Shot Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04), 6062–6069. https://doi.org/10.1609/aaai.v34i04.6069
Yacouby, R., & Axman, D. (2020). Probabilistic Extension of Precision, Recall, and F1 Score for More Thorough Evaluation of Classification Models. Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, 79–91. https://doi.org/10.18653/v1/2020.eval4nlp-1.9
Yu, L., Chen, L., Dong, J., Li, M., Liu, L., Zhao, B., & Zhang, C. (2020). Detecting Malicious Web Requests Using an Enhanced TextCNN. 2020 IEEE 44th Annual Computers, Software, and Applications Conference (COMPSAC), 768–777. https://doi.org/10.1109/COMPSAC48688.2020.0-167
Downloads
Published
Issue
Section
License
Copyright (c) 2024 Engineering, MAthematics and Computer Science Journal (EMACS)
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
a. Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License - Share Alike that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
b. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
c. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.
USER RIGHTS
All articles published Open Access will be immediately and permanently free for everyone to read and download. We are continuously working with our author communities to select the best choice of license options, currently being defined for this journal as follows: Creative Commons Attribution-Share Alike (CC BY-SA)