Finding Biomarkers from a High-Dimensional Imbalanced Dataset Using the Hybrid Method of Random Undersampling and Lasso
Keywords:biomarkers, high-dimensional imbalanced dataset, Random Undersampling (RUS), Lasso hybrid method
The research conducted undersampling and gene selection as a starting point for cancer classification in gene expression datasets with a high-dimensional and imbalanced class. It investigated whether implementing undersampling before gene selection gave better results than without implementing undersampling. The used undersampling method was Random Undersampling (RUS), and for gene selection, it was Lasso. Then, the selected genes based on theory were validated. To explore the effectiveness of applying RUS before gene selection, the researchers used two gene expression datasets. Both of the datasets consisted of two classes, 1.545 observations and 10.935 genes, but had a different imbalance ratio. The results show that the proposed gene selection methods, namely Lasso and RUS + Lasso, can produce several important biomarkers, and the obtained model has high accuracy. However, the model is complicated since it involves too many genes. It also finds that undersampling is not affected when it is implemented in a less imbalanced class. Meanwhile, when the dataset is highly imbalanced, undersampling can remove a lot of information from the majority class. Nevertheless, the effectiveness of undersampling remains unclear. Simulation studies can be carried out in the next research to investigate when undersampling should be implemented.
Algamal, Z. Y., & Lee, M. H. (2015). Penalized logistic regression with the adaptive LASSO for gene selection in high-dimensional cancer classification. Expert Systems with Applications, 42(23), 9326-9332. https://doi.org/10.1016/j.eswa.2015.08.016
Cimino-Mathews, A., Subhawong, A. P., Illei, P. B., Sharma, R., Halushka, M. K., Vang, R., ... & Argani, P. (2013). GATA3 expression in breast carcinoma: Utility in triple-negative, sarcomatoid, and metastatic carcinomas. Human Pathology, 44(7), 1341-1349. https://doi.org/10.1016/j.humpath.2012.11.003
Crijns, A. P. G., De Graeff, P., Geerts, D., Ten Hoor, K. A., Hollema, H., Van Der Sluis, T., ... & De Vries, E. G. E. (2007). MEIS and PBX homeobox proteins in ovarian cancer. European Journal of Cancer, 43(17), 2495-2505. https://doi.org/10.1016/j.ejca.2007.08.025
Dal Pozzolo, A., Caelen, O., & Bontempi, G. (2015). When is undersampling effective in unbalanced classification tasks? In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 200-215). Springer. https://doi.org/10.1007/978-3-319-23528-8_13
Friedman, J., Hastie, T., & Holger, H. (2007). Pathwise coordinate optimization. Annals of Applied Statistics, 1(2), 302-332. https://doi.org/10.1214/07-AOAS131
Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1-22.
Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3(Mar), 1157-1182.
Hastie, T., Tibshirani, R., & Wainwright, M. (2015). Statistical learning with sparsity: The lasso and generalizations. New York: Chapman and Hall/CRC.
Kang, C., Huo, Y., Xin, L., Tian, B., & Yu, B. (2019). Feature selection and tumor classification for microarray data using relaxed Lasso and generalized multi-class support vector machine. Journal of Theoretical Biology, 463, 77-91. https://doi.org/10.1016/j.jtbi.2018.12.010
Kaur, P., & Gosain, A. (2018). Comparing the behavior of oversampling and undersampling approach of class imbalance learning by combining class imbalance problem with noise. In ICT Based Innovations (pp. 23-30). Springer. https://doi.org/10.1007/978-981-10-6602-3_3
Liu, H., Shi, J., Wilkerson, M. L., & Lin, F. (2012). Immunohistochemical evaluation of GATA3 expression in tumors and normal tissues: A useful immunomarker for breast and urothelial carcinomas. American Journal of Clinical Pathology, 138(1), 57-64. https://doi.org/10.1309/AJCP5UAFMSA9ZQBZ
Liu, Z., Yamanouchi, K., Ohtao, T., Matsumura, S., Seino, M., Shridhar, V., ... & Kurachi, H. (2014). High levels of Wilms' tumor 1 (WT1) expression were associated with aggressive clinical features in ovarian cancer. Anticancer Research, 34(5), 2331-2340.
Lu, F., & Petkova, E. (2014). A comparative study of variable selection methods in the context of developing psychiatric screening instruments. Statistics in Medicine, 33(3), 401-421. https://doi.org/10.1002/sim.5937
Mazumder, R., Friedman, J. H., & Hastie, T. (2011). SparseNet: Coordinate descent with nonconvex penalties. Journal of the American Statistical Association, 106(495), 1125-1138. https://doi.org/10.1198/jasa.2011.tm09738
Myrthue, A., Rademacher, B. L. S., Pittsenbarger, J., Kutyba-Brooks, B., Gantner, M., Qian, D. Z., & Beer, T. M. (2008). The iroquois homeobox gene 5 is regulated by 1,25-dihydroxyvitamin D3 in human prostate cancer and regulates apoptosis and the cell cycle in LNCaP prostate cancer cells. Clinical Cancer Research, 14(11), 3562-3570. https://doi.org/10.1158/1078-0432.CCR-07-4649
Salzberg, S. L. (2018). Open questions How many genes do we have ? BMC Biology, 16(94), 1-3. https://doi.org/https://doi.org/10.1186/s12915-018-0564-x
Sarkar, S., Bristow, C. A., Dey, P., Rai, K., Perets, R., Ramirez-Cardenas, A., ... & McGuire, M. (2017). PRKCI promotes immune suppression in ovarian cancer. Genes & Development, 31(11), 1109-1121. https://doi.org/10.1101/gad.296640.117
Shaoxian, T., Baohua, Y., Xiaoli, X., Yufan, C., Xiaoyu, T., Hongfen, L., ... & Wentao, Y. (2017). Characterisation of GATA3 expression in invasive breast cancer: Differences in histological subtypes and immunohistochemically defined molecular subtypes. Journal of Clinical Pathology, 70(11), 926-934. https://doi.org/10.1136/jclinpath-2016-204137
Sur, P., & Candès, E. J. (2019). A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences, 116(29), 14516-14525. https://doi.org/10.1073/pnas.1810420116
Taj, A., Rehman, A., & Bajwa, S. Z. (2020). Biomarkers and their role in detection of biomolecules. In A. Wu & W. S. Khan (Eds.), Nanobiosensors: From design to applications (pp. 73-94). Germany: Wiley-VCH.
Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267-288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J., & Tibshirani, R. J. (2012). Strong rules for discarding predictors in Lasso‐type problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(2), 245-266. https://doi.org/10.1111/j.1467-9868.2011.01004.x
Tsang, T. Y., Wei, W., Itamochi, H., Tambouret, R., & Birrer, M. J. (2017). Integrated genomic analysis of clear cell ovarian cancers identified PRKCI as a potential therapeutic target. Oncotarget, 8(57), 96482-96495.
Witwicki, R. M., Ekram, M. B., Qiu, X., Janiszewska, M., Shu, S., Kwon, M., ... & Yu, K. (2018). TRPS1 is a lineage-specific transcriptional dependency in breast cancer. Cell Reports, 25(5), 1255-1267. https://doi.org/10.1016/j.celrep.2018.10.023
Wu, S., Jiang, H., Shen, H., & Yang, Z. (2018). Gene selection in cancer classification using sparse logistic regression with L1/2 regularization. Applied Sciences, 8(9), 1-12. https://doi.org/10.3390/app8091569
Yin, H., & Gai, K. (2015). An empirical study on preprocessing high-dimensional class-imbalanced data for classification. In 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems (pp. 1314-1319). IEEE. https://doi.org/10.1109/HPCC-CSS-ICESS.2015.205
Zhang, H., Wang, J., Sun, Z., Zurada, J. M., & Pal, N. R. (2019). Feature selection for neural networks using group Lasso regularization. IEEE Transactions on Knowledge and Data Engineering, 32(4), 659-673. https://doi.org/10.1109/TKDE.2019.2893266
Copyright (c) 2020 Masithoh Yessi Rochayani, Umu Sa'adah, Ani Budi Astuti
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
a. Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License - Share Alike that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
b. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
c. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.
All articles published Open Access will be immediately and permanently free for everyone to read and download. We are continuously working with our author communities to select the best choice of license options, currently being defined for this journal as follows: