Mechanism by Integrating Multi-Source Heterogeneous Data and Ensemble Learning
DOI: https://doi.org/10.62381/ACS.SSFS2025.07
Author(s)
Pingxiang Shi1,2,*
Affiliation(s)
1Department of Intelligent technology and services, Cityu University, Macau, China
2School of Faculty of Data Science, Cityu University, Macau, China
*Corresponding author.
Abstract
With the rapid development of fintech, traditional credit risk assessment models face significant limitations in handling multi-source heterogeneous data (e.g., social networks, transaction behaviors, unstructured text). This paper systematically reviews advancements in multi-source data fusion technologies, ensemble learning algorithms, and credit risk early warning mechanisms, proposing a novel "data-model-decision" trinity framework. The study shows that heterogeneous data fusion enhances risk identification coverage (e.g., social data improves prediction accuracy for small businesses by 12%), while ensemble learning reduces overfitting risks through model diversity (AUC increases by 8%–15%). Future research must address challenges in data privacy, model interpretability, and dynamic adaptability to advance credit risk management toward intelligent and real-time evolution.
Keywords
Multi-Source Heterogeneous Data; Ensemble Learning; Credit Default Risk; Early Warning Mechanism; Data Fusion
References
[1] IDC. (2023). Worldwide data creation and replication forecast, 2023–2027. https://www.idc.com
[2] Lessmann, S., et al. (2015). Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research. European Journal of Operational Research, 247 (1), 124–136.
[3] Kolda, T. G., & Bader, B. W. (2009). Tensor decompositions and applications. SIAM Review, 51 (3), 455–500.
[4] Yager, R. R. (2016). Generalized Dempster-Shafer structures. IEEE Transactions on Fuzzy Systems, 24 (5), 1280–1286.
[5] Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
[6] Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794.
[7] Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5 (2), 241–259. https://doi.org/10.1016/S0893-6080(05)80023-1
[8] Carbone, P., et al. (2015). Apache Flink: Stream and batch processing in a single engine. IEEE Data Engineering Bulletin, 38 (4), 28–38.
[9] Zhou, J., et al. (2020). Graph neural networks: A review of methods and applications. AI Open, 1, 57–81.
[10] Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). "Why should I trust you?" Explaining the predictions of any classifier. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1135–1144.
[11] Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 30, 4765–4774.
[12] Guidotti, R., et al. (2018). A survey of methods for explaining black box models. ACM Computing Surveys, 51 (5), 1–42.
[13] European Parliament. (2018). General Data Protection Regulation (GDPR). Official Journal of the European Union.
[14] Webb, G. I., et al. (2016). Characterizing concept drift. Data Mining and Knowledge Discovery, 30 (4), 964–994.
[15] Kairouz, P., et al. (2021). Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14 (1–2), 1–210.
[16] Pearl, J. (2019). The seven tools of causal inference, with reflections on machine learning. Communications of the ACM, 62 (3), 54–60.