Algorithm for Assessing the Quality of Medical Synthetic Data Based on The Multi-Criteria and Pac-Bayesian Model
Published 2026-03-31
Keywords
- Synthetic data,
- medical artificial intelligence,
- multi-criteria assessment
How to Cite
Copyright (c) 2026 Juraev Gulomjon Primovich, Jovlieva Dilnoz Mustofa kizi

This work is licensed under a Creative Commons Attribution 4.0 International License.
Abstract
The effectiveness of medical artificial intelligence systems is directly related to the quality and level of representativeness of the training sample. However, real clinical data are characterized by problems such as confidentiality limitations, class mismatch, and heterogeneity. In solving these problems, synthetic data is considered a promising solution, however, a comprehensive assessment of their quality remains a pressing issue.
In this work, an algorithm for assessing the quality of synthetic data based on a multi-criteria approach and PAC-Bayesian theory is proposed. The proposed model combines statistical proximity (KL-divergence), inter-character relationships (mutual information), predictive efficiency, and subject-specific limitations within a single integrated functional.
Experimental results showed that the synthetic data satisfactorily reflect the main features of the real distribution (KL = 0.6035), and inter-characteristic relationships are maintained with high accuracy (MI = 0.0072). The model, trained on the basis of synthetic data, achieved an accuracy of 80.4%, which confirms the possibility of its practical application. The domain matching criterion showed the maximum value (1.0).
The results show that the proposed approach ensures a balance between the quality of synthetic data and model reliability.
References
- I.Goodfellow, J.Pouget-Abadie, M.Mirza, B.Xu, D.Warde-Farley, S.Ozair, A.Courville and Y.Bengio. "Generative Adversarial Networks," in Advances in Neural Information Processing Systems (NeurIPS), 2014, pp. 2672-2680.
- M.Arjovsky, S.Chintala and L.Bottou. "Wasserstein GAN," in Proceedings of the 34th International Conference on Machine Learning (ICML), 2017, pp. 214-223.
- D.P.Kingma and M.Welling. "Auto-Encoding Variational Bayes," in Proceedings of the International Conference on Learning Representations (ICLR), 2014.
- L.Hu, M.Skoularidou, A.Cuesta-Infante and K.Veeramachaneni. "Modeling Tabular Data using Conditional GAN," in Advances in Neural Information Processing Systems (NeurIPS) , 2019.
- J.Yoon, J.Jordon and M.van der Schaar. "PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees," in International Conference on Learning Representations (ICLR) , 2019.
- C. Dwork "Differential Privacy" in Proceedings of the 33rd International Colloquium on Automata, Languages and Programming (ICALP), 2006, pp. 1-12.
- C.Dwork and A.Roth. The Algorithmic Foundations of Differential Privacy. Boston, MA, USA: Now Publishers.
- R.Shokri, M.Stronati, C.Song and V.Shmatikov. "Membership Inference Attacks Against Machine Learning Models," in IEEE Symposium on Security and Privacy, 2017, pp. 3-18.
- A.Esteban, S.L.Hyland and G.Rätsch. "Real-valued (Medical) Time Series Generation with Recurrent Conditional GANs," in Machine Learning for Healthcare Conference, 2017, pp. 276-296.
- A.Borji. "Pros and Cons of GAN Evaluation Measures," Computer Vision and Image Understanding, vol. 179, pp. 41-65, 2019.
- E. J.Topol, "High-performance medicine: the convergence of human and artificial intelligence," Natural Medicine, vol. 25, pp. 44-56, 2019.
- R. Miotto, F. Wang, S. Wang, X. Jiang, and J. T. Dudley, "Deep learning for healthcare: review, opportunities and challenges," Briefings in Bioinformatics, vol. 19, no. 6, pp. 1236-1246, 2018.
- R.Chen, J.Lu and Z.Chen, "Synthetic Data in Healthcare: A Survey," Journal of the American Medical Informatics Association, vol. 28, no. 11, pp. 2497-2508, 2021.
- B.Jayaraman and D.Evans. "Evaluating Differentially Private Machine Learning in Practice," in USENIX Security Symposium, 2019, pp. 1895-1912.
- A.El Emam, "Seven ways to evaluate the utility of synthetic data," IEEE Security & Privacy, vol. 18, no. 3, pp. 56-62, 2020.
- UCI Machine Learning Repository, "Heart Disease Dataset." [Online]. Available: https://archive.ics.uci.edu/ml/datasets/heart+disease.