Finding Biomarkers from a High-Dimensional Imbalanced Dataset Using the Hybrid Method of Random Undersampling and Lasso


  • Masithoh Yessi Rochayani Universitas Brawijaya
  • Umu Sa'adah Universitas Brawijaya
  • Ani Budi Astuti Universitas Brawijaya



biomarkers, high-dimensional imbalanced dataset, Random Undersampling (RUS), Lasso hybrid method


The research conducted undersampling and gene selection as a starting point for cancer classification in gene expression datasets with a high-dimensional and imbalanced class. It investigated whether implementing undersampling before gene selection gave better results than without implementing undersampling. The used undersampling method was Random Undersampling (RUS), and for gene selection, it was Lasso. Then, the selected genes based on theory were validated. To explore the effectiveness of applying RUS before gene selection, the researchers used two gene expression datasets. Both of the datasets consisted of two classes, 1.545 observations and 10.935 genes, but had a different imbalance ratio. The results show that the proposed gene selection methods, namely Lasso and RUS + Lasso, can produce several important biomarkers, and the obtained model has high accuracy.  However, the model is complicated since it involves too many genes. It also finds that undersampling is not affected when it is implemented in a less imbalanced class. Meanwhile, when the dataset is highly imbalanced, undersampling can remove a lot of information from the majority class. Nevertheless, the effectiveness of undersampling remains unclear. Simulation studies can be carried out in the next research to investigate when undersampling should be implemented.

Author Biographies

Masithoh Yessi Rochayani, Universitas Brawijaya

Department of Statistics, Faculty of Mathematics and Natural Sciences

Umu Sa'adah, Universitas Brawijaya

Department of Statistics, Faculty of Mathematics and Natural Sciences

Ani Budi Astuti, Universitas Brawijaya

Department of Statistics, Faculty of Mathematics and Natural Sciences


Algamal, Z. Y., & Lee, M. H. (2015). Penalized logistic regression with the adaptive LASSO for gene selection in high-dimensional cancer classification. Expert Systems with Applications, 42(23), 9326-9332.

Cimino-Mathews, A., Subhawong, A. P., Illei, P. B., Sharma, R., Halushka, M. K., Vang, R., ... & Argani, P. (2013). GATA3 expression in breast carcinoma: Utility in triple-negative, sarcomatoid, and metastatic carcinomas. Human Pathology, 44(7), 1341-1349.

Crijns, A. P. G., De Graeff, P., Geerts, D., Ten Hoor, K. A., Hollema, H., Van Der Sluis, T., ... & De Vries, E. G. E. (2007). MEIS and PBX homeobox proteins in ovarian cancer. European Journal of Cancer, 43(17), 2495-2505.

Dal Pozzolo, A., Caelen, O., & Bontempi, G. (2015). When is undersampling effective in unbalanced classification tasks? In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 200-215). Springer.

Friedman, J., Hastie, T., & Holger, H. (2007). Pathwise coordinate optimization. Annals of Applied Statistics, 1(2), 302-332.

Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1-22.

Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3(Mar), 1157-1182.

Hastie, T., Tibshirani, R., & Wainwright, M. (2015). Statistical learning with sparsity: The lasso and generalizations. New York: Chapman and Hall/CRC.

Kang, C., Huo, Y., Xin, L., Tian, B., & Yu, B. (2019). Feature selection and tumor classification for microarray data using relaxed Lasso and generalized multi-class support vector machine. Journal of Theoretical Biology, 463, 77-91.

Kaur, P., & Gosain, A. (2018). Comparing the behavior of oversampling and undersampling approach of class imbalance learning by combining class imbalance problem with noise. In ICT Based Innovations (pp. 23-30). Springer.

Liu, H., Shi, J., Wilkerson, M. L., & Lin, F. (2012). Immunohistochemical evaluation of GATA3 expression in tumors and normal tissues: A useful immunomarker for breast and urothelial carcinomas. American Journal of Clinical Pathology, 138(1), 57-64.

Liu, Z., Yamanouchi, K., Ohtao, T., Matsumura, S., Seino, M., Shridhar, V., ... & Kurachi, H. (2014). High levels of Wilms' tumor 1 (WT1) expression were associated with aggressive clinical features in ovarian cancer. Anticancer Research, 34(5), 2331-2340.

Lu, F., & Petkova, E. (2014). A comparative study of variable selection methods in the context of developing psychiatric screening instruments. Statistics in Medicine, 33(3), 401-421.

Mazumder, R., Friedman, J. H., & Hastie, T. (2011). SparseNet: Coordinate descent with nonconvex penalties. Journal of the American Statistical Association, 106(495), 1125-1138.

Myrthue, A., Rademacher, B. L. S., Pittsenbarger, J., Kutyba-Brooks, B., Gantner, M., Qian, D. Z., & Beer, T. M. (2008). The iroquois homeobox gene 5 is regulated by 1,25-dihydroxyvitamin D3 in human prostate cancer and regulates apoptosis and the cell cycle in LNCaP prostate cancer cells. Clinical Cancer Research, 14(11), 3562-3570.

Salzberg, S. L. (2018). Open questions How many genes do we have ? BMC Biology, 16(94), 1-3.

Sarkar, S., Bristow, C. A., Dey, P., Rai, K., Perets, R., Ramirez-Cardenas, A., ... & McGuire, M. (2017). PRKCI promotes immune suppression in ovarian cancer. Genes & Development, 31(11), 1109-1121.

Shaoxian, T., Baohua, Y., Xiaoli, X., Yufan, C., Xiaoyu, T., Hongfen, L., ... & Wentao, Y. (2017). Characterisation of GATA3 expression in invasive breast cancer: Differences in histological subtypes and immunohistochemically defined molecular subtypes. Journal of Clinical Pathology, 70(11), 926-934.

Sur, P., & Candès, E. J. (2019). A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences, 116(29), 14516-14525.

Taj, A., Rehman, A., & Bajwa, S. Z. (2020). Biomarkers and their role in detection of biomolecules. In A. Wu & W. S. Khan (Eds.), Nanobiosensors: From design to applications (pp. 73-94). Germany: Wiley-VCH.

Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267-288.

Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J., & Tibshirani, R. J. (2012). Strong rules for discarding predictors in Lasso‐type problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(2), 245-266.

Tsang, T. Y., Wei, W., Itamochi, H., Tambouret, R., & Birrer, M. J. (2017). Integrated genomic analysis of clear cell ovarian cancers identified PRKCI as a potential therapeutic target. Oncotarget, 8(57), 96482-96495.

Witwicki, R. M., Ekram, M. B., Qiu, X., Janiszewska, M., Shu, S., Kwon, M., ... & Yu, K. (2018). TRPS1 is a lineage-specific transcriptional dependency in breast cancer. Cell Reports, 25(5), 1255-1267.

Wu, S., Jiang, H., Shen, H., & Yang, Z. (2018). Gene selection in cancer classification using sparse logistic regression with L1/2 regularization. Applied Sciences, 8(9), 1-12.

Yin, H., & Gai, K. (2015). An empirical study on preprocessing high-dimensional class-imbalanced data for classification. In 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems (pp. 1314-1319). IEEE.

Zhang, H., Wang, J., Sun, Z., Zurada, J. M., & Pal, N. R. (2019). Feature selection for neural networks using group Lasso regularization. IEEE Transactions on Knowledge and Data Engineering, 32(4), 659-673.