Finding Biomarkers from a High-Dimensional Imbalanced Dataset Using the Hybrid Method of Random Undersampling and Lasso

Masithoh Yessi Rochayani; Umu Sa'adah; Ani Budi Astuti

doi:10.21512/comtech.v11i2.6452

Authors

Masithoh Yessi Rochayani Universitas Brawijaya
Umu Sa'adah Universitas Brawijaya
Ani Budi Astuti Universitas Brawijaya

DOI:

https://doi.org/10.21512/comtech.v11i2.6452

Keywords:

biomarkers, high-dimensional imbalanced dataset, Random Undersampling (RUS), Lasso hybrid method

Abstract

The research conducted undersampling and gene selection as a starting point for cancer classification in gene expression datasets with a high-dimensional and imbalanced class. It investigated whether implementing undersampling before gene selection gave better results than without implementing undersampling. The used undersampling method was Random Undersampling (RUS), and for gene selection, it was Lasso. Then, the selected genes based on theory were validated. To explore the effectiveness of applying RUS before gene selection, the researchers used two gene expression datasets. Both of the datasets consisted of two classes, 1.545 observations and 10.935 genes, but had a different imbalance ratio. The results show that the proposed gene selection methods, namely Lasso and RUS + Lasso, can produce several important biomarkers, and the obtained model has high accuracy. However, the model is complicated since it involves too many genes. It also finds that undersampling is not affected when it is implemented in a less imbalanced class. Meanwhile, when the dataset is highly imbalanced, undersampling can remove a lot of information from the majority class. Nevertheless, the effectiveness of undersampling remains unclear. Simulation studies can be carried out in the next research to investigate when undersampling should be implemented.

Dimensions

Author Biographies

Masithoh Yessi Rochayani, Universitas Brawijaya

Department of Statistics, Faculty of Mathematics and Natural Sciences

Umu Sa'adah, Universitas Brawijaya

Department of Statistics, Faculty of Mathematics and Natural Sciences

Ani Budi Astuti, Universitas Brawijaya

Department of Statistics, Faculty of Mathematics and Natural Sciences

References

Algamal, Z. Y., & Lee, M. H. (2015). Penalized logistic regression with the adaptive LASSO for gene selection in high-dimensional cancer classification. Expert Systems with Applications, 42(23), 9326-9332. https://doi.org/10.1016/j.eswa.2015.08.016

Cimino-Mathews, A., Subhawong, A. P., Illei, P. B., Sharma, R., Halushka, M. K., Vang, R., ... & Argani, P. (2013). GATA3 expression in breast carcinoma: Utility in triple-negative, sarcomatoid, and metastatic carcinomas. Human Pathology, 44(7), 1341-1349. https://doi.org/10.1016/j.humpath.2012.11.003

Crijns, A. P. G., De Graeff, P., Geerts, D., Ten Hoor, K. A., Hollema, H., Van Der Sluis, T., ... & De Vries, E. G. E. (2007). MEIS and PBX homeobox proteins in ovarian cancer. European Journal of Cancer, 43(17), 2495-2505. https://doi.org/10.1016/j.ejca.2007.08.025

Dal Pozzolo, A., Caelen, O., & Bontempi, G. (2015). When is undersampling effective in unbalanced classification tasks? In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 200-215). Springer. https://doi.org/10.1007/978-3-319-23528-8_13

Friedman, J., Hastie, T., & Holger, H. (2007). Pathwise coordinate optimization. Annals of Applied Statistics, 1(2), 302-332. https://doi.org/10.1214/07-AOAS131

Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1-22.

Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3(Mar), 1157-1182.

Hastie, T., Tibshirani, R., & Wainwright, M. (2015). Statistical learning with sparsity: The lasso and generalizations. New York: Chapman and Hall/CRC.

Kang, C., Huo, Y., Xin, L., Tian, B., & Yu, B. (2019). Feature selection and tumor classification for microarray data using relaxed Lasso and generalized multi-class support vector machine. Journal of Theoretical Biology, 463, 77-91. https://doi.org/10.1016/j.jtbi.2018.12.010

Kaur, P., & Gosain, A. (2018). Comparing the behavior of oversampling and undersampling approach of class imbalance learning by combining class imbalance problem with noise. In ICT Based Innovations (pp. 23-30). Springer. https://doi.org/10.1007/978-981-10-6602-3_3

Liu, H., Shi, J., Wilkerson, M. L., & Lin, F. (2012). Immunohistochemical evaluation of GATA3 expression in tumors and normal tissues: A useful immunomarker for breast and urothelial carcinomas. American Journal of Clinical Pathology, 138(1), 57-64. https://doi.org/10.1309/AJCP5UAFMSA9ZQBZ

Liu, Z., Yamanouchi, K., Ohtao, T., Matsumura, S., Seino, M., Shridhar, V., ... & Kurachi, H. (2014). High levels of Wilms' tumor 1 (WT1) expression were associated with aggressive clinical features in ovarian cancer. Anticancer Research, 34(5), 2331-2340.

Lu, F., & Petkova, E. (2014). A comparative study of variable selection methods in the context of developing psychiatric screening instruments. Statistics in Medicine, 33(3), 401-421. https://doi.org/10.1002/sim.5937

Mazumder, R., Friedman, J. H., & Hastie, T. (2011). SparseNet: Coordinate descent with nonconvex penalties. Journal of the American Statistical Association, 106(495), 1125-1138. https://doi.org/10.1198/jasa.2011.tm09738

Myrthue, A., Rademacher, B. L. S., Pittsenbarger, J., Kutyba-Brooks, B., Gantner, M., Qian, D. Z., & Beer, T. M. (2008). The iroquois homeobox gene 5 is regulated by 1,25-dihydroxyvitamin D3 in human prostate cancer and regulates apoptosis and the cell cycle in LNCaP prostate cancer cells. Clinical Cancer Research, 14(11), 3562-3570. https://doi.org/10.1158/1078-0432.CCR-07-4649

Salzberg, S. L. (2018). Open questions How many genes do we have ? BMC Biology, 16(94), 1-3. https://doi.org/https://doi.org/10.1186/s12915-018-0564-x

Sarkar, S., Bristow, C. A., Dey, P., Rai, K., Perets, R., Ramirez-Cardenas, A., ... & McGuire, M. (2017). PRKCI promotes immune suppression in ovarian cancer. Genes & Development, 31(11), 1109-1121. https://doi.org/10.1101/gad.296640.117

Shaoxian, T., Baohua, Y., Xiaoli, X., Yufan, C., Xiaoyu, T., Hongfen, L., ... & Wentao, Y. (2017). Characterisation of GATA3 expression in invasive breast cancer: Differences in histological subtypes and immunohistochemically defined molecular subtypes. Journal of Clinical Pathology, 70(11), 926-934. https://doi.org/10.1136/jclinpath-2016-204137

Sur, P., & CandÃ¨s, E. J. (2019). A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences, 116(29), 14516-14525. https://doi.org/10.1073/pnas.1810420116

Taj, A., Rehman, A., & Bajwa, S. Z. (2020). Biomarkers and their role in detection of biomolecules. In A. Wu & W. S. Khan (Eds.), Nanobiosensors: From design to applications (pp. 73-94). Germany: Wiley-VCH.

Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267-288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x

Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J., & Tibshirani, R. J. (2012). Strong rules for discarding predictors in Lasso‐type problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(2), 245-266. https://doi.org/10.1111/j.1467-9868.2011.01004.x

Tsang, T. Y., Wei, W., Itamochi, H., Tambouret, R., & Birrer, M. J. (2017). Integrated genomic analysis of clear cell ovarian cancers identified PRKCI as a potential therapeutic target. Oncotarget, 8(57), 96482-96495.

Witwicki, R. M., Ekram, M. B., Qiu, X., Janiszewska, M., Shu, S., Kwon, M., ... & Yu, K. (2018). TRPS1 is a lineage-specific transcriptional dependency in breast cancer. Cell Reports, 25(5), 1255-1267. https://doi.org/10.1016/j.celrep.2018.10.023

Wu, S., Jiang, H., Shen, H., & Yang, Z. (2018). Gene selection in cancer classification using sparse logistic regression with L1/2 regularization. Applied Sciences, 8(9), 1-12. https://doi.org/10.3390/app8091569

Yin, H., & Gai, K. (2015). An empirical study on preprocessing high-dimensional class-imbalanced data for classification. In 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems (pp. 1314-1319). IEEE. https://doi.org/10.1109/HPCC-CSS-ICESS.2015.205

Zhang, H., Wang, J., Sun, Z., Zurada, J. M., & Pal, N. R. (2019). Feature selection for neural networks using group Lasso regularization. IEEE Transactions on Knowledge and Data Engineering, 32(4), 659-673. https://doi.org/10.1109/TKDE.2019.2893266