Enhancing bankruptcy prediction efficiency using synthetic data

Елизавета В. Лашкевич

doi:10.17323/2587-814X.2025.3.22.47

Елизавета В. Лашкевич Высшая школа бизнеса, Национальный исследовательский университет «Высшая школа экономики», Москва, Россия https://orcid.org/0000-0002-3241-2291

DOI: https://doi.org/10.17323/2587-814X.2025.3.22.47

Ключевые слова: синтетические данные, прогнозирование финансовой несостоятельности, дисбаланс классов

Аннотация

Прогнозирование финансовой несостоятельности компаний имеет решающее значение для инвесторов, кредиторов и регулирующих органов. Однако доступ к высококачественным, сбалансированным данным для обучения моделей часто ограничен из-за соображений конфиденциальности, нехватки информации или особенностей предоставления финансовой отчетности. В данной работе исследуется потенциал методов создания синтетических данных для увеличения экземпляров миноритарного класса в несбалансированных наборах данных и тем самым потенциального улучшения моделей прогнозирования несостоятельности. В работе сравнивается производительность различных методов снижения дисбаланса, включая такие классические, как, например, метод синтетического увеличения выборки меньшинства (Synthetic Minority Over-sampling Technique), с новыми подходами к генерации синтетических данных на основе байесовских сетей, маргинальных распределений, случайных лесов и генеративных состязательных сетей. Исследуется эффективность этих методов с точки зрения их способности улучшить такие показатели классификации, как коэффициент Джини, среднее геометрическое, доля ложно положительных и ложно отрицательных решений. В качестве выборки для эксперимента взяты реальные финансовые показатели промышленных компаний малого и среднего бизнеса Финляндии за 2021. Полученные результаты вносят вклад в растущий объем знаний о генерации синтетических данных и их применении для решения проблем несбалансированных наборов данных и улучшения прогностического моделирования в финансовой сфере, а также дают представление об эффективности различных методов создания синтетических данных для сэмплирования несбалансированных наборов данных и повышения точности и надежности моделей прогнозирования несостоятельности фирм.

Скачивания

Данные скачивания пока не доступны.

Литература

Ildefonso M.V., Laureano R.M., Vasarhelyi M.A. (2023) Predictive models of insolvency: A systematic literature review. 2023 18th Iberian Conference on Information Systems and Technologies (CISTI), pp. 1–7. https://doi.org/10.23919/CISTI58278.2023.10211516

BarNiv R., McDonald J.B. (1992) Identifying financial distress in the insurance industry: A synthesis of methodological and empirical issues. Journal of Risk and Insurance, pp. 543–573. https://doi.org/10.2307/253344

Petropoulos A., Siakoulis V., Stavroulakis E., Vlachogiannakis N. (2020) Predicting bank insolvencies using machine learning techniques. International Journal of Forecasting, vol. 36, pp. 1092–1113. https://doi.org/10.1016/j.ijforecast.2019.11.005

Sanya S., Wolfe S. (2010) Ownership structure, revenue diversification and insolvency risk in European banks. SSRN (Social Science Research Network). https://doi.org/10.2139/ssrn.1102626

Pitselis G. (2008) An overview on solvency supervision, regulations and prediction of insolvency. Belgian Actuarial Bulletin, vol. 8, no. 1, pp. 37–53.

Beaver W.H. (1966) Financial ratios as predictors of failure. Journal of Accounting Research, pp. 71–111. https://doi.org/10.2307/2490171

Altman E.I. (1968) Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. Journal of Finance, vol. 23, no. 4, pp. 589–609. https://doi.org/10.2307/2978933

Shumway T. (2001) Forecasting bankruptcy more accurately: A simple hazard model. Journal of Business, vol. 74, no. 1, pp. 101–124. https://doi.org/10.1086/209665

Sisodia D.S., Verma U. (2018) The impact of data re-sampling on learning performance of class imbalanced bankruptcy prediction models. International Journal on Electrical Engineering and Informatics, vol. 10, no. 3, pp. 433–446. https://doi.org/10.15676/IJEEI.2018.10.3.2

Vellamcheti S., Singh P. (2020) Class imbalance deep learning for bankruptcy prediction. 2020 First International Conference on Power, Control and Computing Technologies (ICPC2T), pp. 421–425. https://doi.org/10.1109/ICPC2T48082.2020.9071460

Veganzones D., Séverin E. (2018) An investigation of bankruptcy prediction in imbalanced datasets. Decision Support Systems, vol. 112, pp. 111–124. https://doi.org/10.1016/j.dss.2018.06.011

Garcia J. (2022) Bankruptcy prediction using synthetic sampling. Machine Learning with Applications, vol. 9, article 100343. https://doi.org/10.1016/j.mlwa.2022.100343

Sattarov T., Schreyer M., Borth D. (2023) Findiff: Diffusion models for financial tabular data generation. Proceedings of the Fourth ACM International Conference on AI in Finance, pp. 64–72. https://doi.org/10.48550/arXiv.2309.01472

Ramzan F., Sartori C., Consoli S., Reforgiato Recupero D. (2024) Generative adversarial networks for synthetic data generation in finance: Evaluating statistical similarities and quality assessment. AI, vol. 5, no. 2, pp. 667–685. https://doi.org/10.3390/ai5020035

de Meer Pardo F. (2019) Enriching financial datasets with generative adversarial networks. MS thesis. Delft University of Technology. The Netherlands.

Assefa S.A., Dervovic D., Mahfouz M., et al. (2020) Generating synthetic data in finance: opportunities, challenges and pitfalls. Proceedings of the First ACM International Conference on AI in Finance, article 44. https://doi.org/10.1145/3383455.3422554

Le T.L.M., Park J.R., Baik S.W. (2018) Oversampling techniques for bankruptcy prediction: novel features from a transaction dataset. Symmetry, vol. 10, no. 4, article 79. https://doi.org/10.3390/sym10040079

Chawla N.V., Bowyer K.W., Hall L.O., Kegelmeyer W.P. (2002) SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, vol. 16, pp. 321–357. https://doi.org/10.1613/jair.953

He H., Bai Y., Garcia E.A., Li S. (2008) ADASYN: Adaptive synthetic sampling approach for imbalanced learning. 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322–1328. https://doi.org/10.1109/IJCNN.2008.4633969

Xu L., Skoularidou M., Cuesta-Infante A., Veeramachaneni K. (2019) Modeling tabular data using conditional GAN. arXiv:1907.00503. https://doi.org/10.48550/arXiv.1907.00503

Devi D., Purkayastha B. (2017) Redundancy-driven modified Tomek-link based undersampling: A solution to class imbalance. Pattern Recognition Letters, vol. 93, pp. 3–12. https://doi.org/10.1016/j.patrec.2016.10.006

Pardo F.D.M., López R.C. (2020) Mitigating overfitting on financial datasets with generative adversarial networks. The Journal of Financial Data Science, vol. 2, no. 1, pp. 76–85.

Eckerli F., Osterrieder J. (2021) Generative adversarial networks in finance: an overview. arXiv:2106.06364. https://doi.org/10.48550/arXiv.2106.06364

Majid R., Mir S.A. (2018) Advances in statistical forecasting methods: An overview. Economic Affairs, vol. 63, no. 4, pp. 815–831. https://doi.org/10.30954/0424-2513.4.2018.5

Nair J. (2019) Corporate distress and bankruptcy prediction – A critical review of statistical methods and models. Abhigyan, vol. 37, no. 2, pp. 10–20.

Billios D., Seretidou D., Stavropoulos A. (2024) The power of numerical indicators in predicting bankruptcy: A systematic review. Journal of Risk and Financial Management, vol. 17, no. 10, article 433. https://doi.org/10.3390/jrfm17100433

Ding J., Tarokh V., Yang Y. (2018) Model selection techniques: An overview. IEEE Signal Processing Magazine, vol. 35, no. 6, pp. 16–34. https://doi.org/10.1109/MSP.2018.2867638

Barboza F., Kimura H., Altman E. (2017) Machine learning models and bankruptcy prediction. Expert Systems with Applications, vol. 83, pp. 405–417. https://doi.org/10.1016/j.eswa.2017.04.006

Sulistiani I., Mufida E., Yasser P.M., Alamsyah L. (2021) Systematic literature review: Bankruptcy prediction Menggunakan Teknik machine learning dan deep learning. INTECH (Informatika dan Teknologi), vol. 2, no. 1, pp. 13–18. https://doi.org/10.54895/intech.v2i1.824

Chen J.M. (2019) Models for predicting business bankruptcies and their application to banking and financial regulation. Penn State Law Review, vol. 123, pp. 735–752. https://doi.org/10.2139/ssrn.3329147

Soukal I., Mačí J., Trnková G., et al. (2024) A state-of-the-art appraisal of bankruptcy prediction models focussing on the field’s core authors: 2010–2022. Central European Management Journal, vol. 32, no. 1, pp. 3–30. https://doi.org/10.1108/CEMJ-08-2022-0095

da Silva Mattos E., Shasha D. (2024) Bankruptcy prediction with low-quality financial information. Expert Systems with Applications, vol. 237, article 121418. https://doi.org/10.1016/j.eswa.2023.121418

Wang X., Kräussl Z., Brorsson M. (2024) Datasets for advanced bankruptcy prediction: A survey and taxonomy. arXiv:2411.01928. https://doi.org/10.48550/arXiv.2411.01928

Tian S., Yu Y., Zhou M. (2015) Data sample selection issues for bankruptcy prediction. Risk, Hazards and Crisis in Public Policy, vol. 6, no. 1, pp. 91–116. https://doi.org/10.1002/rhc3.12071

Mann S.C., Logeswaran R. (2021) Data analytics in improved bankruptcy prediction with industrial risk. 14th International Conference on Developments in eSystems Engineering (DeSE), pp. 23–26. https://doi.org/10.1109/DeSE54285.2021.9719372

Chakraborty D., Ranjan R. (2024) Missing data imputation with granular semantics and AI-driven pipeline for bankruptcy prediction. arXiv:2404.00013. https://doi.org/10.48550/arXiv.2404.00013

Abd Elrahman S.M., Abraham A. (2013) A review of class imbalance problem. Journal of Network and Innovative Computing, vol. 1.

Chaves R.M., Rossi A.L.D., Garcia L.P.F. (2023) Financial distress prediction in an imbalanced data stream environment. International Conference on Hybrid Artificial Intelligence Systems (HAIS 2023). Lecture Notes in Computer Science, vol. 14001, pp. 168–179. https://doi.org/10.1007/978-3-031-40725-3_15

Mortaz E. (2020) Imbalance accuracy metric for model selection in multi-class imbalance classification problems. Knowledge-Based Systems, vol. 210, article 106490. https://doi.org/10.1016/j.knosys.2020.106490

Luque A., Carrasco A., Martín A., de Las Heras A. (2019) The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognition, vol. 91, pp. 216–231. https://doi.org/10.1016/j.patcog.2019.02.023

García V., Sánchez J.S., Marqués A.I., et al. (2020) Understanding the apparent superiority of over-sampling through an analysis of local information for class-imbalanced data. Expert Systems with Applications, vol. 158, article 113026. https://doi.org/10.1016/j.eswa.2019.113026

Xie Y., Huang X., Qin F., et al. (2024) A majority affiliation based under-sampling method for class imbalance problem. Information Sciences, vol. 662, article 120263. https://dl.acm.org/doi/10.1016/j.ins.2024.120263

Napierala K., Stefanowski J., Wilk S. (2010) Learning from imbalanced data in presence of noisy and borderline examples. Rough Sets and Current Trends in Computing: 7th International Conference (RSCTC 2010), pp. 158–167. https://doi.org/10.1007/978-3-642-13529-3_18

Seiffert C., Khoshgoftaar T.M., Van Hulse J., Napolitano A. (2007) Mining data with rare events: A case study. 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2007), vol. 2, pp. 132–139. https://doi.org/10.1109/ICTAI.2007.71

Chen N., Vieira A.S., Duarte J., et al. (2009) Cost-sensitive learning vector quantization for financial distress prediction. Progress in Artificial Intelligence: 14th Portuguese Conference on Artificial Intelligence (EPIA 2009), pp. 374–385. https://doi.org/10.1007/978-3-642-04686-5_31

Safi S.A.D., Castillo P.A., Faris H. (2022) Cost-sensitive metaheuristic optimization-based neural network with ensemble learning for financial distress prediction. Applied Sciences, vol. 12, no. 14, article 6918. https://doi.org/10.3390/app12146918

Eltayeb R., Karrar A.E., Osman W.I., Mutasim M. (2023) Handling imbalanced data through re-sampling: Systematic review. Indonesian Journal of Electrical Engineering and Informatics (IJEEI), vol. 11, no. 2, pp. 503–514. https://doi.org/10.52549/.v11i2.4471

Chawla N., Bowyer K., Hall L., Kegelmeyer W. (2002) SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, vol. 16, pp. 321–357. https://doi.org/10.1613/jair.953

Cheng K., Zhang C., Yu H., et al. (2019) Grouped SMOTE with noise filtering mechanism for classifying imbalanced data. IEEE Access, vol. 7, pp. 170668–170681. https://doi.org/10.1109/ACCESS.2019.2955086

Leevy J.L., Khoshgoftaar T.M., Bauder R.A., Seliya N. (2018) A survey on addressing high-class imbalance in big data. Journal of Big Data, vol. 5, no. 1, pp. 1–30. https://doi.org/10.1186/s40537-018-0151-6

Sharma S., Bellinger C., Krawczyk B., et al. (2018) Synthetic oversampling with the majority class: A new perspective on handling extreme imbalance. 2018 IEEE International Conference on Data Mining (ICDM), pp. 447–456. https://doi.org/10.1109/ICDM.2018.00060

Zhang R., Lu S., Yan B., et al. (2023) A density-based oversampling approach for class imbalance and data overlap. Computers & Industrial Engineering, vol. 186, article 109747. https://doi.org/10.1016/j.cie.2023.109747

Hairani H., Widiyaningtyas T., Prasetya D.D. (2024) Addressing class imbalance of health data: A systematic literature review on modified synthetic minority oversampling technique (SMOTE) strategies. International Journal on Informatics Visualization, vol. 8, no. 3, pp. 1310–1318. https://doi.org/10.62527/joiv.8.3.2283

Mehmood A., De Luca F. (2025) Financial distress prediction in private firms: Developing a model for troubled debt restructuring. Journal of Applied Accounting Research, vol. 26, no. 6, pp. 205–222. https://doi.org/10.1108/JAAR-12-2022-0325

O’hara R.B., Sillanpää M.J. (2009) A review of Bayesian variable selection methods: What, how and which. Bayesian Analysis, vol. 4, no. 1, pp. 85–117. https://doi.org/10.1214/09-BA403

He H., Bai Y., Garcia E.A., Li S. (2008) ADASYN: Adaptive synthetic sampling approach for imbalanced learning. 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322–1328. https://doi.org/10.1109/IJCNN.2008.4633969

Le T., Vo M.T., Vo B., et al. (2019) A hybrid approach using oversampling technique and cost‐sensitive learning for bankruptcy prediction. Complexity, vol. 2019, article 8460934. https://doi.org/10.1155/2019/8460934

Ren T., Lu T., Yang Y. (2021) Improved data mining method for class-imbalanced financial distress prediction. Proceedings of the 7th International Conference on Computing and Artificial Intelligence, pp. 308–313. https://doi.org/10.1145/3467707.3467754

Zhou L. (2013) Performance of corporate bankruptcy prediction models on imbalanced dataset: The effect of sampling methods. Knowledge-Based Systems, vol. 41, pp. 16–25. https://doi.org/10.1016/j.knosys.2012.12.007

Krawczyk B., Wozniak M. (2019) On the role of cost-sensitive learning in imbalanced data oversampling. Computational Science–ICCS 2019: 19th International Conference, Faro, Portugal, June 12–14, 2019. Proceedings, Part III, pp. 180–191. https://doi.org/10.1007/978-3-030-22744-9_14

Murad M.A.H., Paul M.K. (2023) A hybrid preprocessing approach for the classification of class imbalanced data. 2023 6th International Conference on Electrical Information and Communication Technology (EICT), pp. 1–6. https://doi.org/10.1109/EICT61409.2023.10427712

Kang Q., Chen X., Li S., Zhou M. (2016) A noise-filtered under-sampling scheme for imbalanced classification. IEEE Transactions on Cybernetics, vol. 47, no. 12, pp. 4263–4274. https://doi.org/10.1109/TCYB.2016.2606104

Palli A.S., Jaafar J., Hashmani M.A., et al. (2022) A hybrid sampling approach for imbalanced binary and multi-class data using clustering analysis. IEEE Access, vol. 10, pp. 118639–118653. https://doi.org/10.1109/ACCESS.2022.3218463

de Morais R.F., Vasconcelos G.C. (2019) Boosting the performance of oversampling algorithms through under-sampling the minority class. Neurocomputing, vol. 343, pp. 3–18.

Figueira A., Vaz B. (2022) Survey on synthetic data generation, evaluation methods and GANs. Mathematics, vol. 10, no. 15, article 2733. https://doi.org/10.3390/math10152733

Fonseca J., Bacao F. (2023) Tabular and latent space synthetic data generation: A literature review. Journal of Big Data, vol. 10, article 115. https://doi.org/10.1186/s40537-023-00792-7

Shorten C., Khoshgoftaar T.M. (2019) A survey on image data augmentation for deep learning. Journal of Big Data, vol. 6, article 60. https://doi.org/10.1186/s40537-019-0197-0

Bayer M., Kaufhold M.-A., Reuter C. (2021) A survey on data augmentation for text classification. ACM Computing Surveys, vol. 55, no. 7, article 146. https://doi.org/10.1145/3544558

Bonabeau E. (2002) Agent-based modeling: Methods and techniques for simulating human systems. Proceedings of the National Academy of Sciences, vol. 99, pp. 7280–7287. https://doi.org/10.1073/pnas.082080899

Goodfellow I., Pouget-Abadie J., Mirza M., et al. (2020) Generative adversarial nets. Communications of the ACM, vol. 63, no. 11, pp. 139–144. https://doi.org/10.1145/3422622

Kingma D.P. (2013) Auto-encoding variational bayes. arXiv:1312.6114. https://doi.org/10.48550/arXiv.1312.6114

Beaulieu-Jones B.K., Wu Z.S., Williams C., et al. (2019) Privacy-preserving generative deep neural networks support clinical data sharing. Circulation: Cardiovascular Quality and Outcomes, vol. 12, no. 7, article e005122. https://doi.org/10.1161/circoutcomes.118.005122

Frid-Adar M., Klang E., Amitai M., et al. (2018) Synthetic data augmentation using GAN for improved liver lesion classification. arXiv:1801.02385. https://doi.org/10.48550/arXiv.1801.02385

Sutskever I., Vinyals O., Le Q.V. (2014) Sequence to sequence learning with neural networks. arXiv:1409.3215. https://doi.org/10.48550/arXiv.1409.3215

Delgado R., Núñez-González J.D. (2022) Bayesian network-based over-sampling method (BOSME) with application to indirect cost-sensitive learning. Scientific Reports, vol. 12, article 8724. https://doi.org/10.1038/s41598-022-12682-8

Li H., Wang S., Jiang J., et al. (2024) Augmenting the diversity of imbalanced datasets via multi-vector stochastic exploration oversampling. Neurocomputing, vol. 583, article 127600. https://doi.org/10.1016/j.neucom.2024.127600

Zhai J., Qi J., Shen C. (2022) Binary imbalanced data classification based on diversity oversampling by generative models. Information Sciences, vol. 585, pp. 313–343. https://doi.org/10.1016/j.ins.2021.11.058

Engelmann J., Lessmann S. (2021) Conditional Wasserstein GAN-based oversampling of tabular data for imbalanced learning. Expert Systems with Applications, vol. 174, article 114582. https://doi.org/10.1016/j.eswa.2021.114582

Majeed A., Hwang S.O. (2023) CTGAN-MOS: Conditional generative adversarial network-based minority-class-augmented oversampling scheme for imbalanced problems. IEEE Access, vol. 11, pp. 85878–85899. https://doi.org/10.1109/ACCESS.2023.3303509

Son M., Jung S., Jung S., Hwang E. (2021) BCGAN: A CGAN-based oversampling model using the boundary class for data balancing. The Journal of Supercomputing, vol. 77, pp. 10463–10487. https://doi.org/10.1007/s11227-021-03688-6

Ai Q., Wang P., He L., et al. (2023) Generative oversampling for imbalanced data via majority-guided VAE. International Conference on Artificial Intelligence and Statistics, pp. 3315–3330. https://doi.org/10.48550/arXiv.2302.10910

Yang G., Ramanan D. (2019) Volumetric correspondence networks for optical flow. Advances in Neural Information Processing Systems, vol. 32.

Zelenkov Y.A., Lashkevich E.V. (2024) Counterfactual explanations based on synthetic data generation. Business Informatics, vol. 18, no. 3, pp. 24–40. http://doi.org/10.17323/2587-814X.2024.3.24.40

Sklar M. (1959) Fonctions de répartition à n dimensions et leurs marges. Annales de l’ISUP, vol. 8, no. 3, pp. 229–231. https://doi.org/10.2139/ssrn.4198458

Nelsen R.B. (2006) An introduction to copulas. Springer.

Joe H. (2014) Dependence modeling with copulas. CRC Press. https://doi.org/10.1201/b17116

Endres M., Mannarapotta Venugopal A., Tran T.S. (2022) Synthetic data generation: A comparative study. Proceedings of the 26th International Database Engineered Applications Symposium, pp. 94–102. https://doi.org/10.1145/3548785.3548793

Pearl J. (2014) Probabilistic reasoning in intelligent systems: Networks of plausible inference. Elsevier.

Chan L.S., Chu A.M., So M.K. (2023) A moving-window Bayesian network model for assessing systemic risk in financial markets. PLoS One, vol. 18, article e0279888. https://doi.org/10.1371/journal.pone.0279888

Koller D., Friedman N. (2009) Probabilistic graphical models: Principles and techniques. MIT Press.

Chickering D.M. (2013) Learning equivalence classes of Bayesian-network structures. arXiv:1302.3566. https://doi.org/10.48550/arXiv.1302.3566

Huang S., Li J., Ye J., et al. (2012) A sparse structure learning algorithm for Gaussian Bayesian network identification from high-dimensional data. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 6, pp. 1328–1342. https://doi.org/10.1109/tpami.2012.129

Yang J., Jiang J., Wen Z., Mian A. (2023) Parallel and distributed Bayesian network structure learning. IEEE Transactions on Parallel and Distributed Systems, vol. 35, no. 4, pp. 517–530. https://doi.org/10.1109/TPDS.2023.3326832

Xu W., Liu A., Zhang Y., Lau V. (2024) Bayesian deep learning via expectation maximization and turbo deep approximate message passing. arXiv:2402.07366. https://doi.org/10.48550/arXiv.2402.07366

Liaw A., Wiener M. (2002) Classification and regression by randomForest. R News, vol. 2, no. 3, pp. 18–22.

Breiman L. (2001) Random forests. Machine Learning, vol. 45, pp. 5–32. https://doi.org/10.1023/A:1010950718922

Mesiar R., Sheikhi A. (2021) Nonlinear random forest classification, a copula-based approach. Applied Sciences, vol. 11, no. 15, article 7140. https://doi.org/10.3390/app11157140

Elavarasan D., Vincent P.D.R. (2021) A reinforced random forest model for enhanced crop yield prediction by integrating agrarian parameters. Journal of Ambient Intelligence and Humanized Computing, vol. 12, no. 11, pp. 10009–10022. https://doi.org/10.1007/s12652-020-02752-y

Kotelnikov A., Baranchuk D., Rubachev I., Babenko A. (2023) TabDDPM: Modelling tabular data with diffusion models. arXiv:2209.15421. https://doi.org/10.48550/arXiv.2209.15421

Qian Z., Cebere B.C., van der Schaar M. (2023) Synthcity: Facilitating innovative use cases of synthetic data in different data modalities. arXiv:2301.07573. https://doi.org/10.48550/arXiv.2301.07573

Fonseca J., Bacao F. (2023) Tabular and latent space synthetic data generation: A literature review. Journal of Big Data, vol. 10, article 115. https://doi.org/10.1186/s40537-023-00792-7

Cai K., Lei X., Wei J., Xiao X. (2021) Data synthesis via differentially private Markov random fields. Proceedings of the VLDB Endowment, vol. 14, no. 11, pp. 2190–2202. https://doi.org/10.14778/3476249.3476272

Koudas N., Srivastava D., Yu T., Zhang Q. (2009) Distribution based microdata anonymization. Proceedings of the VLDB Endowment, vol. 2, no. 1, pp. 958–969. https://doi.org/10.14778/1687627.1687735

Zhang J., Cormode G., Procopiuc C.M., et al. (2017) PrivBayes: Private data release via Bayesian networks. ACM Transactions on Database Systems (TODS), vol. 42, no. 4, pp. 1–41. https://doi.org/10.1145/3134428

Kaur D., Sobiesk M., Patil S., et al. (2021) Application of Bayesian networks to generate synthetic health data. Journal of the American Medical Informatics Association, vol. 28, no. 4, pp. 801–811. https://doi.org/10.1093/jamia/ocaa303

Patki N., Wedge R., Veeramachaneni K. (2016) The synthetic data vault. 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 399–410. https://doi.org/10.1109/DSAA.2016.49

Tahir M.A., Kittler J., Mikolajczyk K., Yan F. (2009) A multiple expert approach to the class imbalance problem using inverse random under sampling. Multiple Classifier Systems: 8th International Workshop, pp. 82–91. https://doi.org/10.1007/978-3-642-02326-2_9

Devi D., Purkayastha B. (2017) Redundancy-driven modified Tomek-link based undersampling: A solution to class imbalance. Pattern Recognition Letters, vol. 93, pp. 3–12. https://doi.org/10.1016/j.patrec.2016.10.006

Tazwar S.M., Knobbout M., Quesada E.H., Popa M. (2024) Tab-VAE: A novel VAE for generating synthetic tabular data. Proceedings of the 13th International Conference on Pattern Recognition Applications and Methods (ICPRAM), pp. 17–26. https://doi.org/10.5220/0012302400003654

Yoon J., Drumright L.N., van der Schaar M. (2020) Anonymization through data synthesis using generative adversarial networks (ADS-GAN). IEEE Journal of Biomedical and Health Informatics, vol. 24, no. 8, pp. 2378–2388. https://doi.org/10.1109/jbhi.2020.2980262

Watson D.S., Blesch K., Kapar J., et al. (2023) Adversarial random forests for density estimation and generative modeling. International Conference on Artificial Intelligence and Statistics, pp. 5357–5375. https://doi.org/10.48550/arXiv.2205.09435

Повышение эффективности прогнозирования банкротств при помощи синтетических данных

Аннотация

Скачивания

Литература