What OpenAI's Latest Red-Teaming Challenge Reveals About the Evolution of AI 'Safety' Practices
Jen Weedon / Aug 7, 2025
Sam Altman, CEO of OpenAI. Shutterstock
OpenAI recently launched its Red-Teaming Challenge for two new open-weight models, gpt-oss-120b and gpt-oss-20b, marking another evolution in how the AI “Safety” field (which itself is understood in varying ways) approaches model evaluation alongside rapid product releases. The competition, hosted on Kaggle with a $500,000 prize pool, incentivizes participants to discover "novel" vulnerabilities that have not been previously identified.
For those unfamiliar, “red teaming” is a structured approach that employs adversarial thinking – intentionally adopting the perspective of an opponent or critic – to stress-test assumptions, expose hidden risks, and identify potential harms. As part of a broader set of approaches for evaluating AI systems, red teaming emphasizes flexibility, creativity, and diverse perspectives, making it particularly valuable for spotting emergent threats and vulnerabilities in rapidly evolving sociotechnological environments.
OpenAI’s proactive approach to soliciting input from the broader community through this challenge is to be commended. At the same time, the scope of the exercise and its instructions on how to evaluate and present findings to the judging panel raise important questions about power dynamics in AI safety, risk, and accountability.
Who decides what counts as evidence? Which risks are considered worthy of addressing and which are not, and why? How are red team findings factored into decision-making?
As red teaming becomes further institutionalized in AI governance, we must examine whether adversarial testing genuinely interrogates system risk or primarily fulfills externally imposed accountability rituals. OpenAI’s latest challenge is an opportunity to explore some of these challenges.
Beyond the usual suspects - But which ones?
What's noteworthy about this challenge is what it doesn't emphasize. Rather than focusing on commonly raised concerns, particularly from the trust and safety community, such as models generating problematic content like child sexual abuse material or harmful misinformation, or enabling parasocial relationships with chatbots that could lead to tragic outcomes, the competition explicitly seeks new vulnerabilities that haven't been discovered or reported before. However, it is unclear what the existing body of known vulnerabilities and risks includes, or how and whether they have been addressed.
The timing of this challenge is also noteworthy. OpenAI is returning to its open-source roots after years of closed, proprietary development, just as the broader AI landscape is adapting to evolving political and regulatory dynamics, including the Trump Administration’s AI Action Plan. But this reorientation comes with a fundamentally different risk calculus. By releasing models under the Apache 2.0 license, OpenAI enables developers and enterprises to freely use, modify, and commercialize the technology, but in doing so, it forfeits the ability to enforce safeguards, implement downstream mitigations, or revoke access if harms emerge.
Indeed, many in the AI community are exploring how AI openness and safety can coexist. For example, a 2024 convening co-hosted by Columbia’s Institute of Global Politics and Mozilla brought together 45 researchers, engineers, and policy leaders to explore these topics under the theme of “A Different Approach to AI Safety” (note: the author participated in the convening). The resulting agenda emphasized that true openness, across model weights, tooling, and governance, can enable safety through scrutiny, decentralization, and cultural diversity.
But we’re not there yet. There are still significant gaps in current safety tools across the model lifecycle, limitations in existing content filtering systems, and the ongoing need for more participatory, pluralistic, and future-proof approaches to AI governance, of which OpenAI’s challenge could be an important contribution.
The capability discovery paradigm: Looking for what?
This challenge focuses on identifying emergent behaviors in mixture-of-experts architectures where a “team of experts” that is trained to handle different tasks, much like visiting a large hospital with specialized doctors, essentially crowdsources the discovery of unknown unknowns. As OpenAI notes: "Finding vulnerabilities that might be subtle, long-horizon, or deeply hidden benefits from thousands of independent minds attacking from novel angles." The challenge explicitly frames the areas of interest around classes of vulnerabilities as a vehicle to introduce harms, which is a critical distinction.
The types of vulnerabilities OpenAI is interested in include:
- Reward hacking and specification gaming
- Strategic deception and hidden motivations
- Sandbagging and evaluation awareness
- Inappropriate tool use and data exfiltration
- Chain-of-thought manipulation
There’s an important distinction here between risk and harm. The methodologies identified in the challenge, or tactics, techniques, and procedures (in security-language parlance), represent entry points for risk, which, if left unmitigated, introduce harms. This conceptual difference matters because it illuminates unresolved questions about who is responsible for mitigating risks, ensuring accountability, and addressing harm across the AI value chain, particularly because deployers and adopters of open models will likely increasingly face liability questions regarding downstream harms.
This framing also reveals deeper assumptions. The concurrently released paper from OpenAI on "worst case frontier risks" focuses narrowly on biology and cybersecurity threats posed by "adversaries" and "determined attackers"—coded language reflecting national security paradigms dominant among AI labs. When the paper discusses "catastrophic cybersecurity harm" from "advanced threat actors" who might "significantly upend the current offensive/defensive balance," it raises the question: who gets to determine this balance? Of course, preventing determined adversaries from gaining capabilities is important. But this emphasis on hypothetical worst-case scenarios should not consistently overshadow the very real harms already occurring.
Methodological insights and missing accountability
While OpenAI notes that "every new issue found can become a reusable test, every novel exploit inspires a stronger defense," what would make this effort more illuminating is greater transparency around how red team findings get actioned and mediated. A more detailed transparency report, building on the existing model card, could help set norms and better contribute to evaluating the efficacy of red teaming exercises.
Specifically, more information on the following aspects would be valuable:
- How different stakeholder perspectives and approaches are weighted
- Which risks get prioritized for mitigation, and why
- How "creativity" in testing relates to addressing inherent model weaknesses
- Whether findings lead to meaningful changes or merely documentation
This level of detail matters because the value of red teaming depends on how its outputs influence decisions. Without that clarity, it’s difficult to know whether we are witnessing genuine safety infrastructure or merely a form of sophisticated accountability theater.
An evolution for AI-safety or performative ritual?
OpenAI’s challenge represents meaningful progress. It reinforces emergent practices that support proactive over reactive safety, encourages more democratized participation in research, and promotes some methodological standardization. The ongoing shift away from narrowly focusing on content moderation in “AI safety” and toward comprehensive vulnerability discovery is crucial as AI systems grow more capable.
At the same time, we need to continue asking: Whose safety is being prioritized? Which timescales are considered meaningful? How do novel discoveries relate to persistent known harms? The emphasis on dramatic, hypothetical threats from "determined attackers" shouldn't continually eclipse the real ongoing harms to vulnerable populations. Similarly, methodological creativity shouldn't come at the expense of addressing fundamental model limitations such as multilingual disparities.
As this challenge unfolds, we may be witnessing either the establishment of genuine safety infrastructure or the refinement of accountability rituals that provide institutional cover while leaving deeper structural issues unaddressed. The critical question isn't just "What don't we know yet about how this AI might fail?" but also "What do we already know, and that we're choosing not to address?"
Authors
