Article | April 2025

While the interest in synthetic data for medical AI keeps growing among academic researchers and medical AI innovators, one question keeps industry leaders hesitant: How does the FDA react to the usage of synthetic data in medical device submissions?

Over the last few weeks, we did a deep dive into the current sentiment of the FDA regarding the technology. This article aims to provide an overview over statements and materials published by the administration to build a clearer picture of the current sentiment as well as steps in data generation and - even more importantly - data evaluation to meet the FDA’s demands.

The FDA believes in the potential of synthetic data and raises concrete concerns

Sources

Various statements by the regulatory bodies indicate a strong push to tackle remaining shortcomings of AI solutions through measures in model training and evaluation. Thereby, they explicitly acknowledge challenges around data sourcing, curation, and annotation.

As part of the Artificial Intelligence (AI) Program in the FDA’s Center for Devices and Radiological Health (CDRH), the goal of this regulatory science research is to study the possibilities and limitations of supplementing medical patient datasets with synthetic data. In a research program on “Addressing the Limitations of Medical Data in AI” the administration states the following:

“Synthetic (also known as in silico) data may allow for obtaining labeled examples more safely and effectively as opposed to collecting real patient data.” - fda.gov

While many question if the FDA will ever perceive synthetic data as good as real data, their statement confirms our belief that in some cases, synthetic data can even be a better alternative to real-world data.

In a Grand Round Workshop in November 2024, FDA researcher Elena Sitzikova, elaborated on the potential and the remaining concerns of the FDA. Before diving into the concerns, it is necessary to outline the FDA’s understanding of synthetic data. The regulatory body, defines two categories.

The first are knowledge base methods which generate synthetic medical data with a deterministic approach based on predefined rules, domain expertise and mathematical models. The second approach are generative methods based on machine learning (ML) models that learn patterns from real-world data and mimic their characteristics.

While knowledge-based simulations are highly explainable, the FDA states that particularly in imaging they often produce unrealistic outputs and are difficult to scale. ML-based generative models on the other hand, are able to produce highly realistic and scalable outputs but are in return difficult to control. Particularly for the latter, the FDA raises a row of concerns:

When generating data based on patterns learned from real data, we run into the risk of propagating and maybe even enhancing existing biases.
Due to the non-deterministic nature of generative AI models we cannot eliminate the risk of hallucinations.
Evaluating synthetic imaging data is highly challenging to evaluate due to its complexity.

To mitigate these risks a profound construction of generative mechanisms and comprehensive evaluation of the resulting synthetic data is essential. In the following we will particularly focus on the latter.

Research publications from the agency provide guidance on data evaluation

Sources

So far the FDA has not yet published any official Guidance or Good Practice documents for the generation, evaluation, or usage of synthetic data in the context developing AI-based medical devices. However, publications from an FDA research group provide a good overview over their perspective on evaluation.

Particularly a paper published June 2024 outlines a “Scorecard for Synthetic Medical Data Evaluation and Reporting”. Consisting of qualitative and quantitative measures, this scorecard aims for a holistic evaluation that considers medically relevant criteria and constraints.

The core of this framework are the 7-Cs of synthetic medical data evaluation, seven quantifiable measures to assess quality, utility, and safety.

Figure 1: 7Cs from the FDA Scorecard for Synthetic Medical Data Evaluation

Unfortunately, no publication gives deeper insight into preferred metrics for the seven criteria. At RYVER.AI we have always been focused on developing a comprehensive evaluation framework for the synthetic data we generate. A combination of three pillars:

Expert Assessments
Embeddings & Features
Downstream Tasks

Figure 2: Synthetic data evaluation methods

Expert Assessments

Two types of expert-based assessments are done. The first are Turing Tests, to check for basic anatomical correctness and major hallucinations. Radiologists are shown a sample of synthetic images and real images with the ask to describe what they see and identify the AI-generated cases.

The second assessment is the evaluation of Annotation Accuracy. Radiologists are presented with synthetic cases and are asked to annotate a specific disease. The expert annotations, are then compared to the annotations that are generated with the synthetic images. To define success, we typically set a benchmark derived from SOTA research datasets.

While such expert-assessments give a good indication on general anatomical and clinical correctness as well as the absence of major hallucinations, they fail to prove data utility for downstream training of AI models. Additionally, it is to be noted, for a practical implementation, it only makes sense to run expert-based evaluations for a subsample of the generated data.

Embeddings & Features

In a next step, computational evaluation of the synthetic data can provide more understanding about the image fidelity, data diversity, and representation of relevant characteristics.

Overall Image Fidelity - The standard metric in research to measure image fidelity is the Ferchét Inception Distance (FID). This measure creates embeddings for real and synthetic images and assesses similarity.

Visual Diversity - Applying statistical tools like Structural Similarity (SSIM) we assess the visual diversity among synthetic images. This ensures that the synthetically generated data does not all just look the same.

These two approaches can give a first indication that the synthetic data generation is creating realistic and visually diverse images. However, research shows they often fail to predict actual data utility. Additionally, they do not provide a deeper understanding whether the synthetic data accurately represents the characteristics specific pathologies and subgroups. For this purpose, we implement a third measure.

Radiomic Features - Academic research has shown that radiomic features can act as predictors for specific pathologies and abnormalities. For example, different tumor subtypes are characterised by specific shapes and intensity distributions. By extracting relevant radiomic features for real and synthetic data we can analyse how well the synthetic data represents real-world characteristics for specific subgroups. This can also include analysis on how well synthetic data generation represents discriminative features between specific subgroups.

Downstream Tasks

Evaluation of data utility in downstream tasks is the litmus test for using synthetic data in AI training. For a strong downstream assessment we need to model the target use case of synthetic data as accurately as possible.

Typically synthetic data is added to the training data of a machine learning model like detection, classification, or segmentation. At RYVER.AI we typically establish three different performance levels that need to be compared:

Baseline - Performance of a model before further intervention.
Benchmark - Performance to be achieved by adding synthetic data.
SynD Application - Performance that is actually achieved by adding synthetic data.

To set a sound baseline and strong benchmark, we typically reproduce SOTA models from academic research. The assessment of model performance happens in line with good Machine Learning practices on a hold-out test and validation dataset. In the case of synthetic data evaluation, it is essential to ensure that there is no data leakage into the training data of the generative model.

At RYVER.AI we typically conduct a combination of different downstream tests, that can be categorised mainly three different buckets:

Figure 3: Categories of downstream tests

Depending on the use case of synthetic data, we can modify these assessments. For each test, it is essential to conduct a comprehensive analysis of performance by looking at various metrics and different subgroups. For classification tasks, for example, this means assessing the impact on specificity and sensitivity, instead of only checking accuracy.

<aside> 📖

</aside>

Meeting FDA demands for Synthetic Medical Data Evaluation

Taking the FDA’s scorecard for synthetic medical data evaluation as the primary guidance on the agency’s requirements, a combination of the outlined measures can cover all the demanded analyses.

Defining a comprehensive but practical setup has to be done on a case-by-case basis. Different pathological patterns or downstream tasks require a different focus in the evaluation. The following provides an overview on how these concrete measures address the 7C-framework:

Figure 4: Addressing the 7Cs with concrete data evaluation methods

Two of the 7Cs do not directly consider the evaluation of data quality and utility, namely Compliance and Comprehension. Hence, the outlined evaluation metrics are not well suited to address these requirements. However, there are a row of measures that can be implemented.

Compliance is mainly concerned with data usage in line with privacy protection of patients. This can be ensured with various measures assessing the risk of reconstructing private patient information.

With Comprehension the FDA means understanding of the generative mechanism itself. For Deep Learning Models like the Diffusion Models used at RYVER.AI, explainability of model results is an ongoing research topic. To reduce the uncertainty of these models we implement a a rule-based model guidance using anatomical segmentation masks.

A deep dive on those two 2Cs will be covered in another article coming soon. Subscribe here, to not miss it!

Summary

In summary, the FDA acknowledges the growing role of synthetic data in medical AI and takes a cautiously supportive stance. While synthetic data holds promise for expanding datasets and reducing biases, the agency emphasises that its use must be carefully justified and rigorously validated. The FDA encourages developers to demonstrate that synthetic data faithfully represents real-world variability and maintains clinical relevance. Their guidance highlights the need for transparent documentation, clear labelling, and thoughtful integration with real data.

Evaluation should be context-dependent, tailored to the intended use, and ideally benchmarked against real-world performance. The proposed framework builds on two key pillars: comparability and complementarity—ensuring synthetic data behaves like real data when it matters, and adds value where real data falls short. Overall, the FDA’s perspective promotes innovation, but not at the expense of safety, performance, or trust.