Mechanism by Integrating Multi-Source Heterogeneous Data and Ensemble Learning_Vol. 9. SSFT2025_Conferences

Home > Conferences > Vol. 9. SSFT2025 >

Mechanism by Integrating Multi-Source Heterogeneous Data and Ensemble Learning

DOI: https://doi.org/10.62381/ACS.SSFS2025.07

Author(s)

Pingxiang Shi1,2,*

Affiliation(s)

1Department of Intelligent technology and services, Cityu University, Macau, China 2School of Faculty of Data Science, Cityu University, Macau, China *Corresponding author.

Abstract

With the rapid development of fintech, traditional credit risk assessment models face significant limitations in handling multi-source heterogeneous data (e.g., social networks, transaction behaviors, unstructured text). This paper systematically reviews advancements in multi-source data fusion technologies, ensemble learning algorithms, and credit risk early warning mechanisms, proposing a novel "data-model-decision" trinity framework. The study shows that heterogeneous data fusion enhances risk identification coverage (e.g., social data improves prediction accuracy for small businesses by 12%), while ensemble learning reduces overfitting risks through model diversity (AUC increases by 8%–15%). Future research must address challenges in data privacy, model interpretability, and dynamic adaptability to advance credit risk management toward intelligent and real-time evolution.

Keywords

Multi-Source Heterogeneous Data; Ensemble Learning; Credit Default Risk; Early Warning Mechanism; Data Fusion

References

[1] IDC. (2023). Worldwide data creation and replication forecast, 2023–2027. https://www.idc.com [2] Lessmann, S., et al. (2015). Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research. European Journal of Operational Research, 247 (1), 124–136. [3] Kolda, T. G., & Bader, B. W. (2009). Tensor decompositions and applications. SIAM Review, 51 (3), 455–500. [4] Yager, R. R. (2016). Generalized Dempster-Shafer structures. IEEE Transactions on Fuzzy Systems, 24 (5), 1280–1286. [5] Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. [6] Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794. [7] Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5 (2), 241–259. https://doi.org/10.1016/S0893-6080(05)80023-1 [8] Carbone, P., et al. (2015). Apache Flink: Stream and batch processing in a single engine. IEEE Data Engineering Bulletin, 38 (4), 28–38. [9] Zhou, J., et al. (2020). Graph neural networks: A review of methods and applications. AI Open, 1, 57–81. [10] Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). "Why should I trust you?" Explaining the predictions of any classifier. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1135–1144. [11] Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 30, 4765–4774. [12] Guidotti, R., et al. (2018). A survey of methods for explaining black box models. ACM Computing Surveys, 51 (5), 1–42. [13] European Parliament. (2018). General Data Protection Regulation (GDPR). Official Journal of the European Union. [14] Webb, G. I., et al. (2016). Characterizing concept drift. Data Mining and Knowledge Discovery, 30 (4), 964–994. [15] Kairouz, P., et al. (2021). Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14 (1–2), 1–210. [16] Pearl, J. (2019). The seven tools of causal inference, with reflections on machine learning. Communications of the ACM, 62 (3), 54–60.