How Big AI Developers are Skirting a Mandate for Training Data Transparency
Dick Blankvoort, Harshvardhan Pandit, Maximilian Gahntz / Mar 4, 2026
Primitive Accumulation by Daniela Zampieri / Better Images of AI / CC by 4.0
There is a battle raging over AI training data. It is taking place in courts, in standardization bodies like the Internet Engineering Task Force (IETF), and in legislatures. Copyright holders are fighting for better compensation and better ways to express preferences on who may do what with their data; AI developers are fighting for more freedom to source and use data to continue scaling and gain a competitive edge. Meanwhile, others are fighting the erosion of the open web and digital commons, from projects like Creative Commons to Wikimedia.
However, too little attention is paid in this context to transparency around training data: what data do AI developers use, how is it used, and from where is it sourced? In fact, there used to be more transparency in this respect — until the “AI race” led developers to become more secretive in order to gain a competitive advantage and skirt liability.
A neglected provision in the European Union’s AI Act may prove to be the biggest break in securing more transparency from AI developers to date. It mandates developers of “general-purpose AI” (or foundation) models to publish a summary of the data they use to train their models in line with a template provided by the European Commission.
As some of us have argued before, this matters not only for publishers and other copyright holders, but also for privacy watchdogs and researchers. Without visibility into what data AI developers are using to train their models, rightsholders can’t verify whether their preferences are respected by AI companies; privacy watchdogs can’t assess whether AI might be trained on sensitive data; and researchers continue to struggle to gain a better understanding of that data fed to large language models.
Despite vigorous opposition to the new disclosure mandate and significant shortcomings in the template, it is, for now, our best chance to get a glimpse at what is fed to leading AI models like Google’s Gemini, Anthropic’s Claude, or OpenAI’s GPT models.
Open source developers are leading in transparency
In new peer-reviewed research supported by Mozilla and accepted to be presented for the 9th ACM Conference on Fairness, Accountability, and Transparency, we have developed a framework to assess AI companies’ training data summaries, building on a long line of research, standards and best practices for quality management in software development. The framework can both support developers in compiling the summaries and the European Commission in assessing whether developers are doing so in good faith and with sufficient detail.
We also assessed the few public summaries that have already been published by developers, particularly from the open source AI community. The good news: of the five summaries we reviewed, only one (the summary for Microsoft’s Phi model) fails to secure a passing grade; all others, such as Hugging Face (for the SmolLM model) and Swiss AI (Apertus) clearly meet the AI Act’s bar. The latter even scored straight A’s and provides a good model for other developers who want to fulfil their legal obligations under the AI Act and meet a high standard of transparency in general. Our findings also show that by no means is it an impossible task to publish such a summary — if small teams of researchers and open source developers can clear this hurdle without being overburdened, well-funded AI labs should certainly be able to do it as well.
Top of the leaderboards, bottom in transparency
Perhaps most notably, and certainly most worryingly, we found that there still aren’t any published summaries to assess from leading AI developers (as corroborated by recent reporting). Despite a legal obligation to publish a summary in line with the template provided by the European Commission’s AI Office, the likes of OpenAI, Google, and xAI have failed to do so. As Zuzanna Warso and Paul Keller from Open Future have noted, some companies have published a paragraph or two about training data along with other model documentation. At best, this is in line with the decreasing level of transparency around training data from leader AI developers that we have gotten used to. It is not even close to what they are legally mandated to publish.
Presumably, they’re exploiting a legal gray area here: While the obligation to publish the summary went into effect last August, the European Commission is still lacking powers to enforce it until later this year. Companies might thus be toeing the line between complying in good faith and giving up as little valuable information to the public and competitors as possible. Finally, it also brings into question the willingness of some large developers to meet their commitments made when signing onto the AI Act’s Code of Practice for general-purpose AI developers, which also underscores the importance of the public summary of training data.
Drawing on research such as ours and expertise from the broader community of researchers and watchdogs will help the AI Office ensure a rigorous, objective, and fair enforcement of these rules. Against the backdrop of our findings, it becomes all the more important that the EU AI Office prepares to enforce blatant non-compliance, particularly by the AI industry's behemoths. Ensuring compliance and transparency should not be to the detriment of smaller developers acting in good faith. And neither should following the law be viewed as optional.
The research findings referenced above can be accessed here.
Authors


