Main Article Content

Abstract

Loan processing is an important aspect of the financial industry, where the right decisions must be made to determine loan approval or rejection. However, the issue of default by loan applicants has become a significant concern for financial institutions. Hence, ensemble learning needs to be used with random forest and Extreme Gradient Boosting (XGBoost) algorithms. Unbalanced data are handled using the Synthetic Minority Over-sampling Technique (SMOTE). This research aimed to improve accuracy and precision in credit risk assessment to reduce human workload. Both algorithms used a dataset of 4,296 with 13 variables relevant to making loan approval decisions. The research process involved data exploration, data preprocessing, data sharing, model training, model evaluation with accuracy, sensitivity, specificity, and F1-score, model selection with 10-fold cross-validation, and important variables. The results showed that XGBoost with imbalanced data handling had the highest accuracy rate of 98.52% and a good balance between sensitivity of 98.83%, specificity of 98.01, and F1-score of 98.81%. The most important variables in determining loan approval are credit score, loan term, loan amount, and annual income.

Keywords

classification ensemble learning loan random forest XGBoost

Article Details

How to Cite
Anadra, R., Sadik, K., Soleh, A. M., & Astari, R. A. (2024). Loan Approval Classification Using Ensemble Learning on Imbalanced Data. Enthusiastic : International Journal of Applied Statistics and Data Science, 4(2), 85–95. https://doi.org/10.20885/enthusiastic.vol4.iss2.art1

References

  1. M.C. Aniceto, F. Barboza, and H. Kimura, “Machine Learning Predictivity Applied to Consumer Creditworthiness,” Future Business Journal, Vol. 6, No. 1, pp. 1–14, 2020, doi: 10.1186/s43093-020-00041-w.
  2. A.A. Ibrahim, R.L. Ridwan, M.M. Muhammed, R.O. Abdulaziz, and G.A. Saheed, “Comparison of the CatBoost Classifier with other Machine Learning Methods,” International Journal of Advanced Computer Science and Applications, Vol. 11, No. 11, pp. 738–748, 2020, doi: 10.14569/IJACSA.2020.0111190.
  3. K. Gupta, B. Chakrabarti, A.A. Ansari, S.S. Rautaray, and M. Pandey, “Loanification - Loan Approval Classification using Machine Learning Algorithms,” in Proceedings of the International Conference on Innovative Computing & Communication (ICICC) 2021, 2021, pp. 1–4, doi: 10.2139/ssrn.3833303.
  4. M. Hanafy and R. Ming, “Machine Learning Approaches for Auto Insurance Big Data,” Risks, Vol. 9, No. 2, pp. 1–23, 2021, doi: 10.3390/risks9020042.
  5. K.A. Nguyen, W. Chen, and B. Lin, “Comparison of Ensemble Machine Learning Methods for Soil Erosion Pin Measurements,” SPRS International Journal of Geo-Information, Vol. 10, No. 1, pp. 1–17, 2021, doi: 10.3390/ijgi10010042.
  6. W. Zhang, C. Wu, H. Zhong, Y. Li, and L. Wang, “Prediction of Undrained Shear Strength Using Extreme Gradient Boosting And Random Forest Based On Bayesian Optimization,” Geoscience Frontiers, Vol. 12, No. 1, pp. 469–477, 2021, doi: 10.1016/j.gsf.2020.03.007.
  7. T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting System,” in KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2026, pp. 785–794, doi: 10.1145/2939672.2939785.
  8. J.M.A.S. Dachi and P. Sitompul, “Comparative Analysis of XGBoost Algorithm and Random Forest Ensemble Learning Algorithm on Credit Decision Classification,” Jurnal Riset Rumpun Matematika dan Ilmu Pengetahuan Alam, Vol. 2, No. 2, pp. 87–103, 2023, doi: 10.55606/jurrimipa.v2i2.1470.
  9. N.V. Chawla, “Data Mining for Imbalanced Datasets: An Overview,” in Data Mining and Knowledge Discovery Handbook, O. Maimon and L. Rokach, Eds. Boston, MA, USA: Springer, 2009, pp. 875–886.
  10. A. Dinh, S. Miertschin, A. Young, and S.D. Mohanty, “A Data-Driven Approach to Predicting Diabetes and Cardiovascular Disease With Machine Learning,” BMC Medical Informatics and Decision Making., Vol. 19, No. 1, pp. 1–15, 2019, doi: 10.1186/s12911-019-0918-5.
  11. J. Tanha, Y. Abdi, N. Samadi, N. Razzaghi, and M. Asadpour, “Boosting Methods for Multi-Class Imbalanced Data Classification: An Experimental Review,” Journal of Big Data, Vol. 7, No. 1, 2020, Art. No. 70, doi: 10.1186/s40537-020-00349-y.
  12. M. Kayri, I. Kayri and M. T. Gencoglu, “The Performance Comparison of Multiple Linear Regression, Random Forest and Artificial Neural Network by Using Photovoltaic and Atmospheric Data,” in 14th International Conference on Engineering of Modern Electric Systems (EMES), 2017, pp. 1–4, doi: 10.1109/EMES.2017.7980368.
  13. M. Nabipour, P. Nayyeri, H. Jabani, S. Shahab, and A. Mosavi, “Predicting Stock Market Trends Using Machine Learning and Deep Learning Algorithms Via Continuous and Binary Data; A Comparative Analysis,” IEEE Access, Vol. 8, pp. 150199–150212, 2020, doi: 10.1109/ACCESS.2020.3015966.
  14. A. Fernández, S. García, F. Herrera, and N.V. Chawla, “SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary,” Journal of Artificial Intelligence Research, Vol. 61, pp. 863–905, 2018, doi: 10.1613/jair.1.11192.
  15. Ž. Đ. Vujović, “Classification Model Evaluation Metrics,” International Journal of Advanced Computer Science and Applications, Vol. 12, No. 6, pp. 599–606, 2021, doi: 10.14569/IJACSA.2021.0120670.
  16. L. Sullivan and W.W. LaMorte, “InterQuartile Range (IQR).” Accessed: 5 September 2024. [Online]. Available: https://sphweb.bumc.bu.edu/otlt/mphmodules/bs/bs704_summarizingdata/bs704_summarizingdata7.html
  17. C.S.K. Dash, A.K. Behera, S. Dehuri, and A. Ghosh, “An Outliers Detection and Elimination Framework in Classification Task of Data Mining,” Decision Analytics Journal, Vol. 6, No. January, 2023, doi: 10.1016/j.dajour.2023.100164.
  18. M. Kuhn and K. Johnson, Applied Predictive Modeling. New York, NY, USA: Springer, 2013.