Main Article Content

Abstract

Glioblastoma is a highly aggressive primary brain tumor with a low survival rate. One of the main challenges in analyzing glioblastoma gene expression data is the presence of missing values, which can reduce biclustering accuracy and affect biological interpretation. This research compared six imputation methods k-nearest neighbors (KNN), mean imputation, singular value decomposition, nonnegative matrix factorization, soft impute, and autoencoderon the GSE4290 gene expression dataset with missing values ranging from 5% to 50%. An evaluation using root mean square error (RMSE), mean absolute error (MAE), and structural similarity index measure (SSIM) showed that soft impute provided the best performance at all levels of missing values, with RMSE of 0.0076, MAE of 0.0073, and perfect SSIM of 1.0000 at 50% missing values. Meanwhile, deep learning-based autoencoder experienced significant performance degradation at high missing values. These findings indicate that more complex models are not always superior, and regularization-based approaches like soft impute are more effective in preserving the biological structure of the data. The results of this research contribute to the optimization of imputation strategies to improve the accuracy of biclustering analysis in glioblastoma studies.

Keywords

Glioblastoma Gene Expression Missing Value Imputation Soft Impute

Article Details

How to Cite
Silalahi, A., Titin Siswantining, & Setia Pramana. (2025). Evaluation of Biclustering Imputation Methods for Glioblastoma Gene Expression Data. Enthusiastic : International Journal of Applied Statistics and Data Science, 5(1), 67–77. https://doi.org/10.20885/enthusiastic.vol5.iss1.art7

References

  1. A.C. Tan, D.M. Ashley, G.Y. López, and M. Malinzak, “Management of glioblastoma: State of the art and future directions,” CA Cancer J. Clin., vol. 70, no. 4, pp. 299–312, Jul. 2020, doi: 10.3322/caac.21613.
  2. O. Troyanskaya et al., “Missing value estimation methods for DNA microarrays,” Bioinformatics, vol. 17, no. 6, pp. 520–525, Jun. 2001, doi: 10.1093/bioinformatics/17.6.520.
  3. O. Alter, P.O. Brown, and D. Botstein, “Singular value decomposition for genome-wide expression data processing and modeling,” Proc. Natl. Acad. Sci., vol. 97, no. 18, pp. 10101–10106, Aug. 2000, doi: 10.1073/pnas.97.18.10101.
  4. R. Mazumder, T. Hastie, and R. Tibshirani, “Spectral regularization algorithms for learning large incomplete matrices,” J. Mach. Learn. Res., vol. 11, pp. 2287–2322, Mar. 2010.
  5. G.P. Way and C.S. Greene, “Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders,” in Pac. Symp. Biocomput., 2018, vol. 23, pp. 80–91.
  6. D. Risso, F. Perraudeau, S. Gribkova, S. Dudoit, and J.P. Vert, “A general and flexible method for signal extraction from single-cell RNA-seq data,” Nat. Commun., vol. 9, Jan. 2018, Art. no 284 (2018), doi: 10.1038/s41467-017-02554-5.
  7. B.K. Beaulieu-Jones and J.H. Moore, “Missing data imputation in the electronic health record using deeply learned autoencoders,” in Pac. Symp. Biocomput, 2017, vol. 22, pp. 207–218.
  8. G. Eraslan, L.M. Simon, M. Mircea, N.S. Mueller, and F J. Theis, “Single-cell RNA-seq denoising using a deep count autoencoder,” Nat. Commun., vol. 10, Jan. 2019, Art. no 390 (2019), doi: 10.1038/s41467-018-07931-2.
  9. C. Lazar, L. Gatto, M. Ferro, C. Bruley, and T. Burger, “Accounting for the multiple natures of missing values in label-free quantitative proteomics data sets to compare imputation strategies,” J. Proteome Res., vol. 15, no. 4, pp. 1116–1125, Apr. 2016, doi: 10.1021/acs.jproteome.5b00981.
  10. G.N. Brock, J.R. Shaffer, R.E. Blakesley, M.J. Lotz, and G.C. Tseng, “Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes,” BMC Bioinform., vol. 9, Jan. 2008, Art. no. 12, doi: 10.1186/1471-2105-9-12.
  11. D. N. Louis et al., “The 2021 WHO classification of tumors of the central nervous system: A summary,” Neuro Oncol., vol. 23, no. 8, pp. 1231–1251, Jun. 2021, doi: 10.1093/neuonc/noab106.
  12. M. Kim and I. Tagkopoulos, “Data integration and predictive modeling methods for multi-omics datasets,” Mol. Omics, vol. 14, no. 1, pp. 8–25, Feb. 2018, doi: 10.1039/C7MO00051K.
  13. R.G. Verhaak et.al., “Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1,” Cancer Cell, vol. 17, no. 1, pp. 98–110, Jan. 2010, doi: 10.1016/j.ccr.2009.12.020.
  14. V.A. Padilha and R.J.G.B. Campello, “A systematic comparative evaluation of biclustering techniques,” BMC Bioinform., vol. 18, Jan. 2017, Art. no 55 (2017), doi: 10.1186/s12859-017-1487-1.
  15. Y. Cheng and G.M. Church, “Biclustering of expression data,” in Proc. Int. Conf. Intell. Syst. Mol. Biol., 2000, pp. 93–103.
  16. M. Kim and I. Tagkopoulos, “Data integration and predictive modeling methods for multi-omics datasets,” Mol. Omics, vol. 14, no. 1, pp. 8–25, Feb. 2018, doi: 10.1039/C7MO00051K.
  17. C. Lazar et al., “Batch effect removal methods for microarray gene expression data integration: A survey,” Brief. Bioinform., vol. 14, no. 4, pp. 469–490, Jul. 2013, doi: 10.1093/bib/bbs037.
  18. P. Orzechowski, A. Pańszczyk, X. Huang, and J.H. Moore, “Runibic: A bioconductor package for parallel row-based biclustering of gene expression data,” Bioinformatics, vol. 34, no. 24, pp. 4302–4304, Dec. 2018, doi: 10.1093/bioinformatics/bty512.
  19. G.P. Way and C.S. Greene, “Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders,” in Pac. Symp. Biocomput., 2018, pp. 80–91.
  20. R. Elyanow, B. Dumitrascu, B.E. Engelhardt, and B. J. Raphael, “netNMF-sc: Leveraging gene-gene interactions for imputation and dimensionality reduction in single-cell expression analysis,” Genome Res., vol. 30, no. 2, pp. 195–204, Feb. 2020, doi: 10.1101/gr.251603.119.
  21. R. Gaujoux and C. Seoighe, “A flexible R package for nonnegative matrix factorization,” BMC Bioinform., vol. 11, Jul. 2010, Art. no 367 (2010), doi: 10.1186/1471-2105-11-367.
  22. P. Li, E.A. Stuart, and D.B. Allison, “Multiple imputation: A flexible tool for handling missing data,” JAMA, vol. 314, no. 18, pp. 1966–1967, Nov. 2015, doi: 10.1001/jama.2015.15281.
  23. Nurzaman, T. Siswantining, S.M. Soemartojo, and D. Sarwinda, “Application of sequential regression multivariate imputation method on multivariate normal missing data,” in 2019 3rd Int. Conf. Inform. Comput. Sci. (ICICoS), 2019, pp. 1–6, doi: 10.1109/ICICoS48119.2019.8982423.
  24. A. Subramanian et al., “Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles,” Proc. Natl. Acad. Sci., vol. 102, no. 43, pp. 15545–15550, Oct. 2005, doi: 10.1073/pnas.0506580102.
  25. G. Eraslan, L.M. Simon, M. Mircea, N.S. Mueller, and F.J. Theis, “Single-cell RNA-seq denoising using a deep count autoencoder,” Nat. Commun., vol. 10, Jan. 2019, Art. no. 390, doi: 10.1038/s41467-018-07931-2.
  26. T. Chai and R.R. Draxler, “Root mean square error (RMSE) or mean absolute error (MAE)? - Arguments against avoiding RMSE in the literature,” Geosci. Model Dev., vol. 7, no. 3, pp. 1247–1250, Jun. 2014, doi: 10.5194/gmd-7-1247-2014.
  27. L. Breiman, “Random forests,” Mach. Learn., vol. 45, pp. 5-32, Oct. 2001, doi: 10.1023/A:1010933404324.
  28. T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., Aug. 2016, pp. 785–794, doi: 10.1145/2939672.293978.
  29. W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Adv. Neural Inf. Process. Syst., 2017, pp. 1024–1034.