No Safety Without Standards: Defining Protocols for AI Red-Teaming Disclosures
Serena Lee / Mar 26, 2024As the state-of-the-art in artificial intelligence (AI) continues to advance, the Biden Administration has mandated the National Institute of Standards and Technology (NIST) to spearhead the development of guidelines that ensure the responsible and secure deployment of AI technologies. Central to this directive is the crafting of comprehensive testing environments and protocols for AI red-teaming. This initiative wades into the murky waters of how we define safety and transparency in the context of AI models. The stakes are particularly high, given the potential for AI to be used in a variety of sensitive applications.
Demystifying Red-Teaming in AI
Historically rooted in cybersecurity, red-teaming's essence involves adopting an adversary's viewpoint to probe and identify vulnerabilities within an organization's security framework. However, red-teaming in AI presents a nuanced divergence from its cybersecurity origin. In AI, red-teaming occupies a unique position on the spectrum of security testing. It goes beyond just pushing a model to its limits through direct assault. Instead, it adopts a more exploratory approach to uncover opportunities for both intentional and accidental misuses of the model. This process involves simulating potential misuse that an AI system might permit when confronted with edge or fringe inputs, thus unveiling unknown failure modes and other hazards.
Conducting these tests before the system is deployed allows for the implementation of safeguards and preventive measures to make the AI model more safe upon release. Such an approach is critical to identify instances when AI outputs could present a threat, such as in the generation of sensitive or potentially hazardous information. For example, OpenAI published research on a red-teaming study that examined whether GPT-4 enhances access to information about known biological threats.
Contrasting views: Pandora’s box or protective shield?
As NIST ventures into establishing benchmarks for red-teaming, it confronts a landscape of contrasting opinions and practices within the industry. Currently, companies employ varied strategies in forming their red teams, from seeking out specialists through open invitations to utilizing crowdsourced workers from platforms like Amazon Mechanical Turk or TaskRabbit. There is also a wide range of uncertainty over what should be disclosed, how it is disclosed, and to whom. Without stronger disclosure standards, companies are left to their own discretion over how transparent they are in sharing the outcomes of these tests and in revealing how they adjusted their models to fix identified vulnerabilities.
The core dilemma lies in the paradox of information disclosure. On one hand, publicly sharing the outcomes of red-teaming, via model cards and academic papers, promotes transparency, facilitates discussions on reducing potential harms of AI models, and ensures accountability within the industry to develop safer models. On the other hand, revealing vulnerabilities found through red-teaming may inadvertently provide adversaries with a blueprint for exploitation. This has been the reasoning for limiting the release of red team data. This is reminiscent of a broader debate within academia over the trade-offs between open scientific research and the dangers of enabling harmful actors by releasing sensitive details. So, how can NIST’s forthcoming standards navigate this complexity and establish a secure framework for disclosure and accountability?
As a starting point, it is essential to clearly define the ethical limits of responsible information disclosure. There are precedents in the world of scientific research, particularly concerning the risks associated with the open dissemination of information on biohazards. Kevin Esvelt from the MIT Media Lab describes this issue as a "tragedy of the commons" scenario, where unrestricted information sharing poses a collective risk. The information hazard in the context of AI models and red-teaming is an abstraction of this same challenge. The danger here is that detailing previous methods used to exploit or 'jailbreak' a model could itself become an information hazard.
The Road Ahead: Towards Comprehensive Standards for Red-Teaming
As NIST develops standards for AI red-teaming, a choice lies ahead between fostering a culture of openness and transparency in the reporting process and addressing legitimate concerns related to various risks and hazards. There must be deliberate evaluation of how much information to release publicly and how companies will be held accountable for remedying the vulnerabilities identified in testing.
Moving forward, it is imperative for standards to incorporate several key considerations to navigate the complexities of red-teaming effectively:
- Secure Testing Frameworks: There is a consensus among members of the Frontier Model Forum (an industry body comprising Anthropic, Google, Microsoft, and OpenAI) on the necessity for well-defined, secure, and controlled environments for red-teaming. Such frameworks are crucial to prevent the leakage of sensitive findings. The current state of red-teaming, often described as “more art than science” lacks standardized practices that enable meaningful comparisons across different AI models.
- Comprehensive Reporting: Pursuant to Section 4.2 of the Executive Order on AI, mandate the submission of red-teaming results along with detailed logs of the technical steps taken to mitigate identified harms to a central body. This entity should focus on understanding identified vulnerabilities and evaluating the effectiveness of countermeasures. It should also compare safety reports from different models to establish a risk scale, assessing how far these models facilitate dangerous behavior beyond what a motivated individual could achieve with the internet alone.
- Stakeholder Access and Reporting Thresholds: Open cross-sector discussion to debate whether broader stakeholder access to the results of red-teaming is warranted and how to implement scalable reporting criteria. Furthermore, there should be broader consideration of who comprises the red team in question. For instance, considering a tiered-reporting strategy, modeled after incident-reporting systems used in different sectors. This approach invites independent and voluntary reports from citizen watchdogs, in addition to requiring reports from organizations.
- Accountability and Transparency: Establish clear protocols for disclosing how vulnerabilities are addressed, ensuring a balance between transparency and the protection of intellectual property. This includes setting explicit criteria for what constitutes successful red-teaming and outlining mechanisms to hold companies accountable for leveraging findings to enhance model safety. As described in (2), there should be a central body in the reporting process. This central body should oversee the identification of national security threats and streamline a framework for publicly disclosing such findings, if needed.
Red-teaming can be a powerful post-training tool to stress-test AI models before releasing them into the wild. However, without clear disclosure standards, the impact of red-teaming efforts and, crucially, the effectiveness of measures to counter vulnerabilities are significantly diminished. Before we rush to embrace fully transparent discussions about the results of red-teaming, we must carefully consider the legitimate concerns regarding the potential risks of disseminating sensitive information. The steps taken to create red-teaming standards will be essential to guiding AI development that prioritizes safety, security, and ethical integrity.
It is equally important to recognize, however, that red-teaming is not a standalone solution to all safety challenges associated with AI models. Its success significantly relies on pre-defined standards of what is considered "good" or "trustworthy" AI. As such, dialogues on red-teaming should be part of wider discussions that set our expectations for AI models and clarify what constitutes their acceptable use.