Main Article Content
Abstract
Glioblastoma is a highly aggressive primary brain tumor with a low survival rate. One of the main challenges in analyzing glioblastoma gene expression data is the presence of missing values, which can reduce biclustering accuracy and affect biological interpretation. This research compared six imputation methods k-nearest neighbors (KNN), mean imputation, singular value decomposition, nonnegative matrix factorization, soft impute, and autoencoderon the GSE4290 gene expression dataset with missing values ranging from 5% to 50%. An evaluation using root mean square error (RMSE), mean absolute error (MAE), and structural similarity index measure (SSIM) showed that soft impute provided the best performance at all levels of missing values, with RMSE of 0.0076, MAE of 0.0073, and perfect SSIM of 1.0000 at 50% missing values. Meanwhile, deep learning-based autoencoder experienced significant performance degradation at high missing values. These findings indicate that more complex models are not always superior, and regularization-based approaches like soft impute are more effective in preserving the biological structure of the data. The results of this research contribute to the optimization of imputation strategies to improve the accuracy of biclustering analysis in glioblastoma studies.
Keywords
Article Details
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).
References
- A.C. Tan, D.M. Ashley, G.Y. López, and M. Malinzak, “Management of glioblastoma: State of the art and future directions,” CA Cancer J. Clin., vol. 70, no. 4, pp. 299–312, Jul. 2020, doi: 10.3322/caac.21613.
- O. Troyanskaya et al., “Missing value estimation methods for DNA microarrays,” Bioinformatics, vol. 17, no. 6, pp. 520–525, Jun. 2001, doi: 10.1093/bioinformatics/17.6.520.
- O. Alter, P.O. Brown, and D. Botstein, “Singular value decomposition for genome-wide expression data processing and modeling,” Proc. Natl. Acad. Sci., vol. 97, no. 18, pp. 10101–10106, Aug. 2000, doi: 10.1073/pnas.97.18.10101.
- R. Mazumder, T. Hastie, and R. Tibshirani, “Spectral regularization algorithms for learning large incomplete matrices,” J. Mach. Learn. Res., vol. 11, pp. 2287–2322, Mar. 2010.
- G.P. Way and C.S. Greene, “Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders,” in Pac. Symp. Biocomput., 2018, vol. 23, pp. 80–91.
- D. Risso, F. Perraudeau, S. Gribkova, S. Dudoit, and J.P. Vert, “A general and flexible method for signal extraction from single-cell RNA-seq data,” Nat. Commun., vol. 9, Jan. 2018, Art. no 284 (2018), doi: 10.1038/s41467-017-02554-5.
- B.K. Beaulieu-Jones and J.H. Moore, “Missing data imputation in the electronic health record using deeply learned autoencoders,” in Pac. Symp. Biocomput, 2017, vol. 22, pp. 207–218.
- G. Eraslan, L.M. Simon, M. Mircea, N.S. Mueller, and F J. Theis, “Single-cell RNA-seq denoising using a deep count autoencoder,” Nat. Commun., vol. 10, Jan. 2019, Art. no 390 (2019), doi: 10.1038/s41467-018-07931-2.
- C. Lazar, L. Gatto, M. Ferro, C. Bruley, and T. Burger, “Accounting for the multiple natures of missing values in label-free quantitative proteomics data sets to compare imputation strategies,” J. Proteome Res., vol. 15, no. 4, pp. 1116–1125, Apr. 2016, doi: 10.1021/acs.jproteome.5b00981.
- G.N. Brock, J.R. Shaffer, R.E. Blakesley, M.J. Lotz, and G.C. Tseng, “Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes,” BMC Bioinform., vol. 9, Jan. 2008, Art. no. 12, doi: 10.1186/1471-2105-9-12.
- D. N. Louis et al., “The 2021 WHO classification of tumors of the central nervous system: A summary,” Neuro Oncol., vol. 23, no. 8, pp. 1231–1251, Jun. 2021, doi: 10.1093/neuonc/noab106.
- M. Kim and I. Tagkopoulos, “Data integration and predictive modeling methods for multi-omics datasets,” Mol. Omics, vol. 14, no. 1, pp. 8–25, Feb. 2018, doi: 10.1039/C7MO00051K.
- R.G. Verhaak et.al., “Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1,” Cancer Cell, vol. 17, no. 1, pp. 98–110, Jan. 2010, doi: 10.1016/j.ccr.2009.12.020.
- V.A. Padilha and R.J.G.B. Campello, “A systematic comparative evaluation of biclustering techniques,” BMC Bioinform., vol. 18, Jan. 2017, Art. no 55 (2017), doi: 10.1186/s12859-017-1487-1.
- Y. Cheng and G.M. Church, “Biclustering of expression data,” in Proc. Int. Conf. Intell. Syst. Mol. Biol., 2000, pp. 93–103.
- M. Kim and I. Tagkopoulos, “Data integration and predictive modeling methods for multi-omics datasets,” Mol. Omics, vol. 14, no. 1, pp. 8–25, Feb. 2018, doi: 10.1039/C7MO00051K.
- C. Lazar et al., “Batch effect removal methods for microarray gene expression data integration: A survey,” Brief. Bioinform., vol. 14, no. 4, pp. 469–490, Jul. 2013, doi: 10.1093/bib/bbs037.
- P. Orzechowski, A. Pańszczyk, X. Huang, and J.H. Moore, “Runibic: A bioconductor package for parallel row-based biclustering of gene expression data,” Bioinformatics, vol. 34, no. 24, pp. 4302–4304, Dec. 2018, doi: 10.1093/bioinformatics/bty512.
- G.P. Way and C.S. Greene, “Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders,” in Pac. Symp. Biocomput., 2018, pp. 80–91.
- R. Elyanow, B. Dumitrascu, B.E. Engelhardt, and B. J. Raphael, “netNMF-sc: Leveraging gene-gene interactions for imputation and dimensionality reduction in single-cell expression analysis,” Genome Res., vol. 30, no. 2, pp. 195–204, Feb. 2020, doi: 10.1101/gr.251603.119.
- R. Gaujoux and C. Seoighe, “A flexible R package for nonnegative matrix factorization,” BMC Bioinform., vol. 11, Jul. 2010, Art. no 367 (2010), doi: 10.1186/1471-2105-11-367.
- P. Li, E.A. Stuart, and D.B. Allison, “Multiple imputation: A flexible tool for handling missing data,” JAMA, vol. 314, no. 18, pp. 1966–1967, Nov. 2015, doi: 10.1001/jama.2015.15281.
- Nurzaman, T. Siswantining, S.M. Soemartojo, and D. Sarwinda, “Application of sequential regression multivariate imputation method on multivariate normal missing data,” in 2019 3rd Int. Conf. Inform. Comput. Sci. (ICICoS), 2019, pp. 1–6, doi: 10.1109/ICICoS48119.2019.8982423.
- A. Subramanian et al., “Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles,” Proc. Natl. Acad. Sci., vol. 102, no. 43, pp. 15545–15550, Oct. 2005, doi: 10.1073/pnas.0506580102.
- G. Eraslan, L.M. Simon, M. Mircea, N.S. Mueller, and F.J. Theis, “Single-cell RNA-seq denoising using a deep count autoencoder,” Nat. Commun., vol. 10, Jan. 2019, Art. no. 390, doi: 10.1038/s41467-018-07931-2.
- T. Chai and R.R. Draxler, “Root mean square error (RMSE) or mean absolute error (MAE)? - Arguments against avoiding RMSE in the literature,” Geosci. Model Dev., vol. 7, no. 3, pp. 1247–1250, Jun. 2014, doi: 10.5194/gmd-7-1247-2014.
- L. Breiman, “Random forests,” Mach. Learn., vol. 45, pp. 5-32, Oct. 2001, doi: 10.1023/A:1010933404324.
- T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., Aug. 2016, pp. 785–794, doi: 10.1145/2939672.293978.
- W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Adv. Neural Inf. Process. Syst., 2017, pp. 1024–1034.
References
A.C. Tan, D.M. Ashley, G.Y. López, and M. Malinzak, “Management of glioblastoma: State of the art and future directions,” CA Cancer J. Clin., vol. 70, no. 4, pp. 299–312, Jul. 2020, doi: 10.3322/caac.21613.
O. Troyanskaya et al., “Missing value estimation methods for DNA microarrays,” Bioinformatics, vol. 17, no. 6, pp. 520–525, Jun. 2001, doi: 10.1093/bioinformatics/17.6.520.
O. Alter, P.O. Brown, and D. Botstein, “Singular value decomposition for genome-wide expression data processing and modeling,” Proc. Natl. Acad. Sci., vol. 97, no. 18, pp. 10101–10106, Aug. 2000, doi: 10.1073/pnas.97.18.10101.
R. Mazumder, T. Hastie, and R. Tibshirani, “Spectral regularization algorithms for learning large incomplete matrices,” J. Mach. Learn. Res., vol. 11, pp. 2287–2322, Mar. 2010.
G.P. Way and C.S. Greene, “Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders,” in Pac. Symp. Biocomput., 2018, vol. 23, pp. 80–91.
D. Risso, F. Perraudeau, S. Gribkova, S. Dudoit, and J.P. Vert, “A general and flexible method for signal extraction from single-cell RNA-seq data,” Nat. Commun., vol. 9, Jan. 2018, Art. no 284 (2018), doi: 10.1038/s41467-017-02554-5.
B.K. Beaulieu-Jones and J.H. Moore, “Missing data imputation in the electronic health record using deeply learned autoencoders,” in Pac. Symp. Biocomput, 2017, vol. 22, pp. 207–218.
G. Eraslan, L.M. Simon, M. Mircea, N.S. Mueller, and F J. Theis, “Single-cell RNA-seq denoising using a deep count autoencoder,” Nat. Commun., vol. 10, Jan. 2019, Art. no 390 (2019), doi: 10.1038/s41467-018-07931-2.
C. Lazar, L. Gatto, M. Ferro, C. Bruley, and T. Burger, “Accounting for the multiple natures of missing values in label-free quantitative proteomics data sets to compare imputation strategies,” J. Proteome Res., vol. 15, no. 4, pp. 1116–1125, Apr. 2016, doi: 10.1021/acs.jproteome.5b00981.
G.N. Brock, J.R. Shaffer, R.E. Blakesley, M.J. Lotz, and G.C. Tseng, “Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes,” BMC Bioinform., vol. 9, Jan. 2008, Art. no. 12, doi: 10.1186/1471-2105-9-12.
D. N. Louis et al., “The 2021 WHO classification of tumors of the central nervous system: A summary,” Neuro Oncol., vol. 23, no. 8, pp. 1231–1251, Jun. 2021, doi: 10.1093/neuonc/noab106.
M. Kim and I. Tagkopoulos, “Data integration and predictive modeling methods for multi-omics datasets,” Mol. Omics, vol. 14, no. 1, pp. 8–25, Feb. 2018, doi: 10.1039/C7MO00051K.
R.G. Verhaak et.al., “Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1,” Cancer Cell, vol. 17, no. 1, pp. 98–110, Jan. 2010, doi: 10.1016/j.ccr.2009.12.020.
V.A. Padilha and R.J.G.B. Campello, “A systematic comparative evaluation of biclustering techniques,” BMC Bioinform., vol. 18, Jan. 2017, Art. no 55 (2017), doi: 10.1186/s12859-017-1487-1.
Y. Cheng and G.M. Church, “Biclustering of expression data,” in Proc. Int. Conf. Intell. Syst. Mol. Biol., 2000, pp. 93–103.
M. Kim and I. Tagkopoulos, “Data integration and predictive modeling methods for multi-omics datasets,” Mol. Omics, vol. 14, no. 1, pp. 8–25, Feb. 2018, doi: 10.1039/C7MO00051K.
C. Lazar et al., “Batch effect removal methods for microarray gene expression data integration: A survey,” Brief. Bioinform., vol. 14, no. 4, pp. 469–490, Jul. 2013, doi: 10.1093/bib/bbs037.
P. Orzechowski, A. Pańszczyk, X. Huang, and J.H. Moore, “Runibic: A bioconductor package for parallel row-based biclustering of gene expression data,” Bioinformatics, vol. 34, no. 24, pp. 4302–4304, Dec. 2018, doi: 10.1093/bioinformatics/bty512.
G.P. Way and C.S. Greene, “Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders,” in Pac. Symp. Biocomput., 2018, pp. 80–91.
R. Elyanow, B. Dumitrascu, B.E. Engelhardt, and B. J. Raphael, “netNMF-sc: Leveraging gene-gene interactions for imputation and dimensionality reduction in single-cell expression analysis,” Genome Res., vol. 30, no. 2, pp. 195–204, Feb. 2020, doi: 10.1101/gr.251603.119.
R. Gaujoux and C. Seoighe, “A flexible R package for nonnegative matrix factorization,” BMC Bioinform., vol. 11, Jul. 2010, Art. no 367 (2010), doi: 10.1186/1471-2105-11-367.
P. Li, E.A. Stuart, and D.B. Allison, “Multiple imputation: A flexible tool for handling missing data,” JAMA, vol. 314, no. 18, pp. 1966–1967, Nov. 2015, doi: 10.1001/jama.2015.15281.
Nurzaman, T. Siswantining, S.M. Soemartojo, and D. Sarwinda, “Application of sequential regression multivariate imputation method on multivariate normal missing data,” in 2019 3rd Int. Conf. Inform. Comput. Sci. (ICICoS), 2019, pp. 1–6, doi: 10.1109/ICICoS48119.2019.8982423.
A. Subramanian et al., “Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles,” Proc. Natl. Acad. Sci., vol. 102, no. 43, pp. 15545–15550, Oct. 2005, doi: 10.1073/pnas.0506580102.
G. Eraslan, L.M. Simon, M. Mircea, N.S. Mueller, and F.J. Theis, “Single-cell RNA-seq denoising using a deep count autoencoder,” Nat. Commun., vol. 10, Jan. 2019, Art. no. 390, doi: 10.1038/s41467-018-07931-2.
T. Chai and R.R. Draxler, “Root mean square error (RMSE) or mean absolute error (MAE)? - Arguments against avoiding RMSE in the literature,” Geosci. Model Dev., vol. 7, no. 3, pp. 1247–1250, Jun. 2014, doi: 10.5194/gmd-7-1247-2014.
L. Breiman, “Random forests,” Mach. Learn., vol. 45, pp. 5-32, Oct. 2001, doi: 10.1023/A:1010933404324.
T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., Aug. 2016, pp. 785–794, doi: 10.1145/2939672.293978.
W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Adv. Neural Inf. Process. Syst., 2017, pp. 1024–1034.
