Home

Donate
Perspective

The Urgency of Standards for Synthetic Data in the Era of Agentic AI

Marcelle Momha / Apr 15, 2026

The views expressed are solely those of the author and do not reflect the positions of their employers.

Lone Thomasky / Bits&Bäume /CC-BY 4.0

For artificial intelligence, data has always been the oxygen fueling the fire: our books, our articles, our photographs, and our digital footprints. But that supply is becoming constrained. As large language models (LLMs) approach the limit of available human data on the internet, developers are turning to a new fuel: synthetic data, artificially generated information designed to mimic or supplement real-world data.

A study by Gartner predicts that by 2030, synthetic data will surpass real-world data in AI training. Another study predicts that between 2026 and 2032, LLMs will exhaust all available human text data on the Internet. In 2022, the synthetic data generation (SDG) market was estimated at $288.5 million. The overall market is now valued at roughly $710 million and will reach $2.3 billion by 2030.

Synthetic data is often sold as a solution when real-world data is scarce, sensitive, or restricted. Indeed, it can bring considerable benefits. However, it also creates what I describe as a “synthetic mirror,” a manufactured reflection of reality that can be brightened, distorted, or conveniently edited. Without standards and targeted regulations governing how synthetic data is generated, documented, deployed and evaluated, it can become a risk multiplier, particularly as agentic AI systems spread.

Current data and AI frameworks are largely designed for static models or human-in-the-loop systems. Autonomous AI agents consume data across diverse sources, infer context, call external tools, and create derivative artifacts in real time. Erroneous synthetic data can corrupt the entire planning and reasoning chain of an agent, leading to hallucinations that manifest as harmful actions rather than just incorrect text. Systems trained on fabricated and artificial representations of reality will lead to opaque decision-making, weaken accountability, and amplify biases.

Promises and pitfalls of synthetic data

The synthetic data market remains fragmented, with many smaller firms recently acquired by larger ones. Companies like K2view, MOSTLY AI, Tonic.ai and Hazy (now Data Maker following acquisition by SAS) lead the production-scale generation. Big tech players like NVIDIA (that acquired Gretel.ai in 2025 for about $320 million in 2025), Amazon, Google, Microsoft, and Meta generate synthetic data at massive scale (trillions of tokens/images for their own models) and offer related services/APIs. In 2024, under the Biden Administration, the US Department of Homeland Security (DHS) awarded contracts to Betterdata, DataCebo, MOSTLY AI, and Rockfish Data to “develop synthetic data capabilities that model and replicate the shape and patterns of real data, while safeguarding privacy and mitigating security harms.

Generative and agentic AI systems not only consume data, but also generate the synthetic data used to train future models. This creates a feedback loop. Errors in a synthetic dataset do not just inform one decision, but can propagate across an entire system, at scale, without human intervention, thus compounding errors recursively and invisibly. These vulnerabilities are uniquely amplified in agentic systems precisely because no human-in-the-loop intervenes to detect the drift or degradation.

The most cited reasons for the use of synthetic data include: the increased demand for scalable, unbiased datasets and the regulatory pressure for privacy-respecting AI models and data sharing. The European Union (EU) AI Act requires organizations to explore synthetic alternatives before processing personal data. However, the text neither anticipates nor specifies the implications of using synthetic data at scale, nor does it contain any provisions regarding decision-making by autonomous systems, without human intervention, based on artificial data.

Theoretically, synthetic data can protect privacy, augment scarce datasets, and even reduce bias by artificially increasing representation of underrepresented groups. In fields where collecting real data is dangerous, expensive, or impossible, synthetic alternatives seem invaluable. However, systems or AI agents built on synthetic data depend upon how accurately that data reflects real-world conditions. Without agreed-upon standards to ensure this outcome is met, these datasets add a layer of uncertainty to already obscured, algorithmic black boxes.

The danger lies in opacity and the lack of traceability. When an AI agent makes a consequential decision, such as denying a loan, flagging a medical condition, or influencing a legal outcome, tracing that decision through layers of synthetic data to determine whether it reflects a legitimate pattern or fabricated artifacts is currently impossible.

Synthetic data is also touted as a way to address bias by increasing representation of underrepresented groups‎‎‎‎. A generative model is not inherently neutral as design choices, training inputs, or optimization processes can encode faulty assumptions and societal biases. Oversimplified representations, excessive data manipulation, and poor modeling can introduce new biases or mask existing ones‎‎‎‎. Downstream analyses risk ignoring real patterns of discrimination, providing a false sense of fairness without addressing underlying systemic disparities and structural issues.

Without standards for quality, reliability, and fairness, it is difficult to evaluate whether synthetic data is appropriate for specific uses. It also complicates accountability. Traditional liability frameworks assume clear causal chains and identifiable responsible parties, conditions that do not align with multi-stage synthetic data pipelines. Documentation tends to be inconsistent and secondary, especially in rapid and agile development environments‎‎‎‎. The regulatory vacuum reduces incentives to adopt rigorous, safe, and ethical practices, further eroding user trust.

Policy landscape for synthetic data

The governance frameworks and standards for synthetic data in the training phase of LLMs remain underdeveloped. The central question is not whether synthetic data is useful. It is whether we can trust the mirror. As the era of Artificial General Intelligence (AGI) approaches, most data protection laws worldwide do not address synthetic data, while some do so indirectly through concepts such as anonymization, pseudonymization, or privacy-enhancing technologies (PETs). Only a few frameworks are sketching early guidance.

The EU AI Act is among the few that explicitly reference “synthetic” data. Article 10 on data and data governance of high-risk AI systems establishes quality requirements for training, validation, and test data. However, paragraph 5 encourages the use of synthetic or anonymized data in high-risk contexts except where the detection and correction of bias cannot be effectively ensured by processing other data, which suggests that legislators and policymakers did not consider the downsides of synthetic or artificially generated data. The General Data Protection Regulation (GDPR) appears not to have accounted for synthetic data. Article 4(5) defines pseudoanonymous data” as information not linked to an identified or identifiable natural person, which could, in some cases, apply to synthetic data, but does not go further.

Synthetic data is not inherently GDPR compliant because it can be considered personal data if there is a reasonable risk of re-identification through patterns, inferences, or links with other datasets, in accordance with Recital 26 of GDPR. It thus becomes pseudonymous rather than anonymous, which requires a legal basis and safeguards. Furthermore, when artificial approximations do not fully reflect the real-world variability and composition of the data, its integrity is threatened.

In the United States, few, if any, state or federal laws, executive orders, or policy frameworks discuss synthetic data. One exception is California’s AB 2013 Gen AI Training Data Transparency Act, which took effect on January 1, 2026. Section 3111 requires developers of generative AI systems and services to publicly disclose the use of synthetic data.

In the United Kingdom, the Office for National Statistics (ONS) has outlined certain broad considerations and requirements for the production and use of synthetic data for statistical research purposes and defined the roles and responsibilities of enforcement agencies. In Singapore, the Personal Data Protection Commission (PDPC) has proposed a guide on the generation of synthetic data, which it describes as PET.

Standards, adaptive policies and legal frameworks

Standards development can play a central role in guiding innovation and technology advancement by fostering trust. There is an urgent need for updated policy tools and legal framework adaptations. The most practical approach to overseeing synthetic mirrors involves targeted amendments to existing AI and data protection frameworks. Synthetic data must be recognized as a distinct regulatory category with unique characteristics while leveraging established principles and enforcement mechanisms.

Clear benchmarks are needed to assess the accuracy and utility of synthetic data, as well as to standardize privacy protection metrics. The standards should require the documentation of how synthetic datasets are generated, trained, and deployed, including:

  • Identification of the generating entity to the end user, as well as statistical or generative models used
  • Known limitations, biases, or privacy considerations of the source data, with preprocessing steps applied
  • An explicit statement of assumptions built into the generator or simulation environment
  • Intended use cases for the synthetic dataset
  • Known limitations or areas where the data are not suitable
  • Results of quality, fidelity, utility, and bias assessments performed
  • A description of privacy-preserving techniques applied, such as differential privacy or others, and associated safeguards and limitations
  • A clear version control for datasets, as they are potentially updated or regenerated.

Comparable to a nutritional label for AI datasets, this type of documentation would improve transparency. Beyond simple quantifiable metrics, comprehensive evaluation frameworks are also needed to evaluate synthetic data across multiple dimensions, including ethical and societal impacts.

As agentic AI systems become more widely used, the real barrier to the adoption of synthetic technologies is not regulation or oversight, but public distrust. Giving businesses and consumers the ability to understand the trade-offs between utility and realism is critical to preparing our societies for AGI.

Authors

Marcelle Momha
Marcelle Momha is an AI consultant at the World Bank and an AI research associate with the Data-Smart City Solutions/Bloomberg Center for Cities at Harvard University. A computer scientist and AI governance specialist, she evaluates AI systems and designs AI adoption and diffusion strategies for gov...

Related

Analysis
Are AI Systems Incompatible with Data Privacy?March 16, 2026
Perspective
How Big AI Developers are Skirting a Mandate for Training Data TransparencyMarch 4, 2026

Topics