Loan Approval Classification Using Ensemble Learning on Imbalanced Data

Rahmi Anadra; Kusman Sadik; Agus M Soleh; Reka Agustia Astari

doi:10.20885/enthusiastic.vol4.iss2.art1

Download

PDF

Statistic

Read Counter : 258 Download : 331

Abstract

Loan processing is an important aspect of the financial industry, where the right decisions must be made to determine loan approval or rejection. However, the issue of default by loan applicants has become a significant concern for financial institutions. Hence, ensemble learning needs to be used with random forest and Extreme Gradient Boosting (XGBoost) algorithms. Unbalanced data are handled using the Synthetic Minority Over-sampling Technique (SMOTE). This research aimed to improve accuracy and precision in credit risk assessment to reduce human workload. Both algorithms used a dataset of 4,296 with 13 variables relevant to making loan approval decisions. The research process involved data exploration, data preprocessing, data sharing, model training, model evaluation with accuracy, sensitivity, specificity, and F1-score, model selection with 10-fold cross-validation, and important variables. The results showed that XGBoost with imbalanced data handling had the highest accuracy rate of 98.52% and a good balance between sensitivity of 98.83%, specificity of 98.01, and F1-score of 98.81%. The most important variables in determining loan approval are credit score, loan term, loan amount, and annual income.

Keywords

classification ensemble learning loan random forest XGBoost

License

Authors who publish with this journal agree to the following terms:

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).

How to Cite

Anadra, R., Sadik, K., Soleh, A. M., & Astari, R. A. (2024). Loan Approval Classification Using Ensemble Learning on Imbalanced Data. Enthusiastic : International Journal of Applied Statistics and Data Science, 4(2), 85–95. https://doi.org/10.20885/enthusiastic.vol4.iss2.art1

Download Citation

References

M.C. Aniceto, F. Barboza, and H. Kimura, “Machine Learning Predictivity Applied to Consumer Creditworthiness,” Future Business Journal, Vol. 6, No. 1, pp. 1–14, 2020, doi: 10.1186/s43093-020-00041-w.
A.A. Ibrahim, R.L. Ridwan, M.M. Muhammed, R.O. Abdulaziz, and G.A. Saheed, “Comparison of the CatBoost Classifier with other Machine Learning Methods,” International Journal of Advanced Computer Science and Applications, Vol. 11, No. 11, pp. 738–748, 2020, doi: 10.14569/IJACSA.2020.0111190.
K. Gupta, B. Chakrabarti, A.A. Ansari, S.S. Rautaray, and M. Pandey, “Loanification - Loan Approval Classification using Machine Learning Algorithms,” in Proceedings of the International Conference on Innovative Computing & Communication (ICICC) 2021, 2021, pp. 1–4, doi: 10.2139/ssrn.3833303.
M. Hanafy and R. Ming, “Machine Learning Approaches for Auto Insurance Big Data,” Risks, Vol. 9, No. 2, pp. 1–23, 2021, doi: 10.3390/risks9020042.
K.A. Nguyen, W. Chen, and B. Lin, “Comparison of Ensemble Machine Learning Methods for Soil Erosion Pin Measurements,” SPRS International Journal of Geo-Information, Vol. 10, No. 1, pp. 1–17, 2021, doi: 10.3390/ijgi10010042.
W. Zhang, C. Wu, H. Zhong, Y. Li, and L. Wang, “Prediction of Undrained Shear Strength Using Extreme Gradient Boosting And Random Forest Based On Bayesian Optimization,” Geoscience Frontiers, Vol. 12, No. 1, pp. 469–477, 2021, doi: 10.1016/j.gsf.2020.03.007.
T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting System,” in KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2026, pp. 785–794, doi: 10.1145/2939672.2939785.
J.M.A.S. Dachi and P. Sitompul, “Comparative Analysis of XGBoost Algorithm and Random Forest Ensemble Learning Algorithm on Credit Decision Classification,” Jurnal Riset Rumpun Matematika dan Ilmu Pengetahuan Alam, Vol. 2, No. 2, pp. 87–103, 2023, doi: 10.55606/jurrimipa.v2i2.1470.
N.V. Chawla, “Data Mining for Imbalanced Datasets: An Overview,” in Data Mining and Knowledge Discovery Handbook, O. Maimon and L. Rokach, Eds. Boston, MA, USA: Springer, 2009, pp. 875–886.
A. Dinh, S. Miertschin, A. Young, and S.D. Mohanty, “A Data-Driven Approach to Predicting Diabetes and Cardiovascular Disease With Machine Learning,” BMC Medical Informatics and Decision Making., Vol. 19, No. 1, pp. 1–15, 2019, doi: 10.1186/s12911-019-0918-5.
J. Tanha, Y. Abdi, N. Samadi, N. Razzaghi, and M. Asadpour, “Boosting Methods for Multi-Class Imbalanced Data Classification: An Experimental Review,” Journal of Big Data, Vol. 7, No. 1, 2020, Art. No. 70, doi: 10.1186/s40537-020-00349-y.
M. Kayri, I. Kayri and M. T. Gencoglu, “The Performance Comparison of Multiple Linear Regression, Random Forest and Artificial Neural Network by Using Photovoltaic and Atmospheric Data,” in 14th International Conference on Engineering of Modern Electric Systems (EMES), 2017, pp. 1–4, doi: 10.1109/EMES.2017.7980368.
M. Nabipour, P. Nayyeri, H. Jabani, S. Shahab, and A. Mosavi, “Predicting Stock Market Trends Using Machine Learning and Deep Learning Algorithms Via Continuous and Binary Data; A Comparative Analysis,” IEEE Access, Vol. 8, pp. 150199–150212, 2020, doi: 10.1109/ACCESS.2020.3015966.
A. Fernández, S. García, F. Herrera, and N.V. Chawla, “SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary,” Journal of Artificial Intelligence Research, Vol. 61, pp. 863–905, 2018, doi: 10.1613/jair.1.11192.
Ž. Đ. Vujović, “Classification Model Evaluation Metrics,” International Journal of Advanced Computer Science and Applications, Vol. 12, No. 6, pp. 599–606, 2021, doi: 10.14569/IJACSA.2021.0120670.
L. Sullivan and W.W. LaMorte, “InterQuartile Range (IQR).” Accessed: 5 September 2024. [Online]. Available: https://sphweb.bumc.bu.edu/otlt/mphmodules/bs/bs704_summarizingdata/bs704_summarizingdata7.html
C.S.K. Dash, A.K. Behera, S. Dehuri, and A. Ghosh, “An Outliers Detection and Elimination Framework in Classification Task of Data Mining,” Decision Analytics Journal, Vol. 6, No. January, 2023, doi: 10.1016/j.dajour.2023.100164.
M. Kuhn and K. Johnson, Applied Predictive Modeling. New York, NY, USA: Springer, 2013.

References

M.C. Aniceto, F. Barboza, and H. Kimura, “Machine Learning Predictivity Applied to Consumer Creditworthiness,” Future Business Journal, Vol. 6, No. 1, pp. 1–14, 2020, doi: 10.1186/s43093-020-00041-w.

A.A. Ibrahim, R.L. Ridwan, M.M. Muhammed, R.O. Abdulaziz, and G.A. Saheed, “Comparison of the CatBoost Classifier with other Machine Learning Methods,” International Journal of Advanced Computer Science and Applications, Vol. 11, No. 11, pp. 738–748, 2020, doi: 10.14569/IJACSA.2020.0111190.

K. Gupta, B. Chakrabarti, A.A. Ansari, S.S. Rautaray, and M. Pandey, “Loanification - Loan Approval Classification using Machine Learning Algorithms,” in Proceedings of the International Conference on Innovative Computing & Communication (ICICC) 2021, 2021, pp. 1–4, doi: 10.2139/ssrn.3833303.

M. Hanafy and R. Ming, “Machine Learning Approaches for Auto Insurance Big Data,” Risks, Vol. 9, No. 2, pp. 1–23, 2021, doi: 10.3390/risks9020042.

K.A. Nguyen, W. Chen, and B. Lin, “Comparison of Ensemble Machine Learning Methods for Soil Erosion Pin Measurements,” SPRS International Journal of Geo-Information, Vol. 10, No. 1, pp. 1–17, 2021, doi: 10.3390/ijgi10010042.

W. Zhang, C. Wu, H. Zhong, Y. Li, and L. Wang, “Prediction of Undrained Shear Strength Using Extreme Gradient Boosting And Random Forest Based On Bayesian Optimization,” Geoscience Frontiers, Vol. 12, No. 1, pp. 469–477, 2021, doi: 10.1016/j.gsf.2020.03.007.

T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting System,” in KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2026, pp. 785–794, doi: 10.1145/2939672.2939785.

J.M.A.S. Dachi and P. Sitompul, “Comparative Analysis of XGBoost Algorithm and Random Forest Ensemble Learning Algorithm on Credit Decision Classification,” Jurnal Riset Rumpun Matematika dan Ilmu Pengetahuan Alam, Vol. 2, No. 2, pp. 87–103, 2023, doi: 10.55606/jurrimipa.v2i2.1470.

N.V. Chawla, “Data Mining for Imbalanced Datasets: An Overview,” in Data Mining and Knowledge Discovery Handbook, O. Maimon and L. Rokach, Eds. Boston, MA, USA: Springer, 2009, pp. 875–886.

A. Dinh, S. Miertschin, A. Young, and S.D. Mohanty, “A Data-Driven Approach to Predicting Diabetes and Cardiovascular Disease With Machine Learning,” BMC Medical Informatics and Decision Making., Vol. 19, No. 1, pp. 1–15, 2019, doi: 10.1186/s12911-019-0918-5.

J. Tanha, Y. Abdi, N. Samadi, N. Razzaghi, and M. Asadpour, “Boosting Methods for Multi-Class Imbalanced Data Classification: An Experimental Review,” Journal of Big Data, Vol. 7, No. 1, 2020, Art. No. 70, doi: 10.1186/s40537-020-00349-y.

M. Kayri, I. Kayri and M. T. Gencoglu, “The Performance Comparison of Multiple Linear Regression, Random Forest and Artificial Neural Network by Using Photovoltaic and Atmospheric Data,” in 14th International Conference on Engineering of Modern Electric Systems (EMES), 2017, pp. 1–4, doi: 10.1109/EMES.2017.7980368.

M. Nabipour, P. Nayyeri, H. Jabani, S. Shahab, and A. Mosavi, “Predicting Stock Market Trends Using Machine Learning and Deep Learning Algorithms Via Continuous and Binary Data; A Comparative Analysis,” IEEE Access, Vol. 8, pp. 150199–150212, 2020, doi: 10.1109/ACCESS.2020.3015966.

A. Fernández, S. García, F. Herrera, and N.V. Chawla, “SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary,” Journal of Artificial Intelligence Research, Vol. 61, pp. 863–905, 2018, doi: 10.1613/jair.1.11192.

Ž. Đ. Vujović, “Classification Model Evaluation Metrics,” International Journal of Advanced Computer Science and Applications, Vol. 12, No. 6, pp. 599–606, 2021, doi: 10.14569/IJACSA.2021.0120670.

L. Sullivan and W.W. LaMorte, “InterQuartile Range (IQR).” Accessed: 5 September 2024. [Online]. Available: https://sphweb.bumc.bu.edu/otlt/mphmodules/bs/bs704_summarizingdata/bs704_summarizingdata7.html

C.S.K. Dash, A.K. Behera, S. Dehuri, and A. Ghosh, “An Outliers Detection and Elimination Framework in Classification Task of Data Mining,” Decision Analytics Journal, Vol. 6, No. January, 2023, doi: 10.1016/j.dajour.2023.100164.

M. Kuhn and K. Johnson, Applied Predictive Modeling. New York, NY, USA: Springer, 2013.

Loan Approval Classification Using Ensemble Learning on Imbalanced Data

Article Sidebar

Main Article Content

Abstract

Keywords

Article Details

References

References