Breast Cancer Classification Using Outlier Detection and Variance Inflation Factor
DOI:
https://doi.org/10.21512/emacsjournal.v5i1.9223Keywords:
Breast Cancer Detection, Logistic Regression, Random Forest, Classification, Variance Inflation FactorAbstract
In terms of malignant tumors, breast cancer is one of the most prevalent. Breast cancer is a form of cancer that develops in the breast tissue when the surrounding, healthy breast tissue is overtaken by the uncontrollably growing cells in the breast tissue. Several features or patient conditions can be used in a machine learning approach to predict breast cancer. Machine learning will be utilized in these situations to determine if the cancer is malignant or benign. The Wisconsin Breast Cancer (Diagnostic) Data Set, which contains 32 characteristics and 569 collected data, was the dataset used in this research.. Feature selection in this study is done by eliminating outliers using the upper and lower quartile of each feature then feature selection is also carried out on features that have features that have a high variance inflation factor. The machine learning methods used in this research are Logistic Regression, Random Forest, KNN, SVC, XG Boost, Gradient Boosting, and Ridge Classifier. The selection of this method is based on the target that will be predicted by 2 labels, namely benign cancer, and malignant cancer. The result obtained is that the selection of features using the variance inflation factor increases the accuracy of the previous Logistic Regression and Random Forest methods from 98.25% to 99.12%. The method that has the highest level of accuracy is the Logistic Regression and Random Forest methods which have a value of 99.12%. The next research will be developed by trying other optimization techniques for hyperparameter tuning.
Plum Analytics
References
S. Nanglia, Muneer Ahmad, Fawad Ali Khan, and N.Z. Jhanjhi, An enhanced Predictive heterogeneous ensemble model for breast cancer prediction, Science Direct, 2022.
Huan-Jung Chiu, Tzuu-Hseng S. Li, (Member, IEEE), and Ping-Huan Kuo, Breast Cancer-Detection System Using PCA, Multilayer Perceptron, Transfer Learning, and Support Vector Machine, IEEE, 2020.
Ilham Mubarog , Arief Setyanto, Heri Sismoro, Sistem Klasifikasi pada Penyakit Breast Cancer dengan Menggunakan Metode Naïve Bayes, Citec Journal, 2019.
Kemal Polat and Ümit Şentürk, A Novel ML Approach to Prediction of Breast Cancer: Combining of mad normalization, KMC based feature weighting and AdaBoostM1 classifier, IEEE, 2018.
Momenimovahed, Z., & Salehiniya, H. (2019). Epidemiological characteristics of and risk factors for breast cancer in the world. Breast Cancer: Targets and Therapy, 11, 151.
Amrane, M., Oukid, S., Gagaoua, I., & Ensari, T. (2018, April). Breast cancer classification using machine learning. In 2018 Electric Electronics, Computer Science, Biomedical Engineerings’ Meeting (EBBT) (pp. 1-4). IEEE.
Yash Amethiya, Prince Pipariya, Shlok Patel, and Mana n Shah, Comparative analysis of breast cancer detection using machine learning and biosensors, Science Direct, 2021.
Lestandy, M. (2022). Deteksi Dini Kanker Payudara Menggunakan Metode Convolution Neural Network (CNN). Inspiration: Jurnal Teknologi Informasi dan Komunikasi, 12(1), 65-72..
Poonam Kathale and Snehal Thorat, Breast Cancer Detection and Classification, IEEE, 2020.
Chand, S. (2020). A comparative study of breast cancer tumor classification by classical machine learning methods and deep learning method. Machine Vision and Applications, 31(6), 1-10.
Ak, M. F. (2020, April). A comparative analysis of breast cancer detection and diagnosis using data visualization and machine learning applications. In Healthcare (Vol. 8, No. 2, p. 111). MDPI.
Wang Zhiqiong, Li Mo, Wang Huaxia, Jiang Hanyu, Yao Yudong, Zhang Hao, and Xin Junchang, Breast Cancer Detection Using Extreme Learning Machine Based on Feature Fusion With CNN Deep Features, IEEE, 2019.
William H Wolberg, W Nick Street, and Olvi L Mangasarian. 1992. Breast cancer Wisconsin (diagnostic) data set. UCI Machine Learning Repository {http://archive.ics.uci.edu/ml/} (1992)
Cristea, A. I., & Troussas, C. (Eds.). (2021). Intelligent Tutoring Systems: 17th International Conference, ITS 2021, Virtual Event, June 7–11, 2021, Proceedings (Vol. 12677). Springer Nature.
Marcoulides, K. M., & Raykov, T. (2019). Evaluation of variance inflation factors in regression models using latent variable modeling methods. Educational and psychological measurement, 79(5), 874-882.
Peng, C. Y. J., & So, T. S. H. (2002). Logistic regression analysis and reporting: A primer. Understanding Statistics: Statistical Issues in Psychology, Education, and the Social Sciences, 1(1), 31-70.
Louppe, G. (2014). Understanding random forests: From theory to practice. arXiv preprint arXiv:1407.7502.
Liparas, D., HaCohen-Kerner, Y., Moumtzidou, A., Vrochidis, S., & Kompatsiaris, I. (2014, November). News articles classification using random forests and weighted multimodal features. In Information Retrieval Facility Conference (pp. 63-75). Springer, Cham
Nasution, D. A., Khotimah, H. H., & Chamidah, N. (2019). Perbandingan Normalisasi Data untuk Klasifikasi Wine Menggunakan Algoritma K-NN. CESS (Journal of Computer Engineering, System and Science), 4(1), 78-82.
Supriyatna, A., & Mustika, W. P. (2018). Komparasi Algoritma Naive bayes dan SVM Untuk Memprediksi Keberhasilan Imunoterapi Pada Penyakit Kutil. J-SAKTI (Jurnal Sains Komputer dan Informatika), 2(2), 152-161.
Wade, C. (2020). Hands-On Gradient Boosting with XGBoost and scikit-learn: Perform accessible machine learning and extreme gradient boosting with Python. Packt Publishing Ltd.
Downloads
Published
Issue
Section
License
Copyright (c) 2023 Engineering, MAthematics and Computer Science (EMACS) Journal
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
a. Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License - Share Alike that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
b. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
c. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.
USER RIGHTS
All articles published Open Access will be immediately and permanently free for everyone to read and download. We are continuously working with our author communities to select the best choice of license options, currently being defined for this journal as follows: Creative Commons Attribution-Share Alike (CC BY-SA)