Integrating Text and Network Analysis to Forecast Default Risk in e-Commerce

B.D. Bernhardt

{}^{1}

C. Marciano

{}^{1}

and M.R. Guarracino

{}^{1}

${}^{1}$

Department of Economics and Law, University of Cassino and Souther Lazio, Cassino, Italy [briandaniel.bernhardt@unicas.it] [chiara.marciaono@unicas.it] [mario.guarracino@unicas.it]

Keywords: Natural Language Processing – Statistical Network Analysis – Supervised Learning

1 Abstract

The Italian e-commerce sector, valued at €80.6 billion as stated by Statista [2024], is a major driver of economic growth. However, Mangiaracina et al. [2009] show that this expansion is accompanied by heightened risks of default, particularly for Small and Medium Enterprises (SMEs), due to structural challenges such as poor integration of logistics, technology, and data management. The European Banking Authority EBA [2024] demonstrated that fraudulent transactions, which account for approximately 3% of e-commerce activity, intensify these risks. This study investigates how alternative data, including contextual variables and text-driven data enrichment, can enhance the prediction of default probability for companies in this sector, a topic that has not been explored in deep in the literature. In particular, Natural Language Processing (NLP) is used to analyze the corporate objectives of companies, utilizing the FLAN-T5 transformer model (developed by Chung et al. [2024]) to extract descriptions of goods and services and SentenceTransformer embeddings to compute semantic similarities (Reimers and Gurevych [2019]). These embeddings form the basis of a weighted similarity network, where node metrics such as degree centrality, closeness centrality, and clustering coefficient are calculated. This network analysis enriches the dataset with structural information about inter-company relationships.

The study constructs two datasets: One dataset contains financial indicators of the SMEs and it was used as benchmark with another dataset including alternative variables and node metrics. Classification models applied include XGBoost, Gradient Boosting, Random Forest, MARS (Multivariate Adaptive Regression Splines), FDA (Flexible Discriminant Analysis), Decision Tree, Logistic Regression, Elastic Net, and LDA (Linear Discriminant Analysis). The models are evaluated using 10-fold cross-validation with metrics such as AUROC, balanced accuracy, sensitivity, specificity, F1 score, and Matthews Correlation Coefficient (MCC).

As we see in Table 1, results demonstrate that integrating alternative data and network metrics significantly improves model performance, particularly in non-linear models like XGBoost, MARS, and FDA. XGBoost achieved a notable increase in AUROC (0.906 to 0.930) and MCC (0.613 to 0.660) with the inclusion of contextual data. MARS and FDA also demonstrated statistically significant improvements across multiple performance metrics. In contrast, linear models (e.g., Logistic Regression and Elastic Net) showed limited gains, suggesting that they struggle to capture the complexity introduced by non-linear interactions between variables.

Table 1: Model Performance Comparison across Different Datasets.

This study provides a novel contribution to the analysis of default risk in the Italian e-commerce sector by integrating NLP-driven text enrichment and network analysis with traditional financial indicators. The findings provide policymakers and supply chain stakeholders with valuable tools to forecast and reduce financial risks, while also helping them identify the key factors that may contribute to the occurrence of default. The methodology presented is scalable and adaptable to other industries, providing a framework for future research on the role of alternative data in predictive modeling.

References

H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, E. Yun, T. Vu, Y. Chen, E. Adams, et al. (2024) Scaling instruction-finetuned language models. Journal of Machine Learning Research 25, pp. 1–53. Note: Submitted 7/23; Revised 2/24; Published 2/24 External Links: Link Cited by: §1.
EBA (2024) 2024 report on payment fraud. Technical report EBA and ECB. External Links: Link Cited by: §1.
R. Mangiaracina, G. Brugnoli, and A. Perego (2009) The ecommerce customer journey: a model to assess and compare the user experience of the ecommerce websites. Journal of Internet Banking and Commerce 14. Cited by: §1.
N. Reimers and I. Gurevych (2019) Sentence-bert: sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 3980–3990. External Links: Link Cited by: §1.
Statista (2024) E-commerce worldwide - outlook. Statista. Cited by: §1.