Shining a Light on “Shadow Prompting”

Eryk Salvaggio / Oct 19, 2023

Eryk Salvaggio is a Research Advisor for Emerging Technology at the Siegel Family Endowment.

When you type a prompt into an AI image model, you might expect that what you type has a meaningful impact on what you get back. But we often fail to see the extent to which requests are modified on the back end by the designers of these systems. OpenAI’s DALL-E 3, announced this month, connects the image generation model to its Large Language Model, GPT-4. Rather than expanding capabilities through ever-expanding datasets, this linkage points to a new strategy for expanding capabilities: combining existing models. This is OpenAI’s experiment in deploying one system to refine and constrain what the other system can produce.

But what is touted as a tool for safety and convenience of users has a dark side. OpenAI has acknowledged that what you prompt is only taken as a suggestion: your words are altered before they reach the model, with opaque editorial decisions employed to filter out problematic requests and obscure the model’s inherent biases.

The practice should raise concerns for anyone concerned with transparency or user agency. We need openness about these decisions and how they are deployed — especially as AI is integrated more deeply into our newsrooms, social media, and other media infrastructure. It will also create friction for independent researchers seeking to understand OpenAI’s design choices, model biases, and vulnerabilities.

Shadow Prompting

On the surface, it makes sense to connect a powerful Large Language Model to an image generation model. For example, users won’t have to labor over prompts to generate detailed images. GPT-4 can extrapolate from minimal text suggestions, adding relevant details that increase realism or liveliness in the resulting images. OpenAI is also leaning into GPT-4 as a content moderation system, flagging problematic requests and shifting them to benign ones, oftentimes without the user’s awareness.

Engineers of automated systems are always navigating a tension between variety and constraint. Users want variety: when autonomous systems are more lively and unpredictable, they can also be more engaging. Early systems lacked strong constraints in favor of this variety, giving them more flavor — for better or for worse. That “flavor” led to chatbots that urged journalists to leave their wives, amidst a long history of generating violent or hateful rhetoric. Constraint, on the other hand, is intended to limit unpredictability. A model trained exclusively on Papal legal documents (magistereum.ai) is less likely to create unpredictable content than a purely open model trained on Reddit and 4chan posts. As models are refined, their variety is reduced, leading many users to grow bored with them, or imagine that they are getting worse.

We should keep this tension in mind when we read OpenAI’s system card for DALL-E 3. The system card tells us how designers chose to implement constraints related to its content moderation policies, and whatpriorities they achieved. Some of these are imposed without a user’s awareness. For example, DALL-E 3’s prompts are “translated” by GPT-4, making them more precise. It can also change a request considered offensive or problematic. And it’s not just through OpenAI. Any integration of the API can transform user requests by creating a system prompt – instructions, in plain English, that modify the user’s query.

This isn’t new. In 2022, Thao Phan and Fabian Offert found evidence that DALL-E 2 inserted diversifying words into user prompts in order to debias images. For example, if prompting “doctor” generates only white men in lab coats from an unmediated dataset, then DALL-E 2 would add words like “black” or “woman” to the prompt in order to ensure diverse images come back to the user. This approach is now confirmed by the DALL-E 3 system card.

OpenAI calls this “Prompt Transformation,” noting that “ChatGPT rewrites submitted text to facilitate prompting DALL-E 3 more effectively. This process also is used to ensure that prompts comply with our guidelines, including removing public figure names, grounding people with specific attributes, and writing branded objects in a generic way.” In other words, the user types a prompt to ChatGPT, which then modifies the request before passing it on to DALL-E 3. The prompt we type is, therefore, not completely connected to the image produced by the system.

Another example relates to misinformation or propaganda. OpenAI will change the names of specific world leaders, such as Kim Jong Un, into the genericized “world leader,” intended to make misleading content harder to generate. There is a real tension at play here — on the one hand, it introduces significant friction to those who would use AI to generate images of US President Joe Biden or former President Donald Trump engaged in corrupt or embarrassing acts. On the other hand: the hidden modification of the user prompt is a unique exercise of power over well-meaning users, and can make it more challenging than ever to understand how to navigate or regulate these systems.

Sometimes, the system refuses to generate the request, and it says so. Other times, it will secretly shift the prompt to less offensive or risky language. A precursor to this practice is the shadow ban, often used on social media platforms. Shadow bans allow censored actors to post, but block their content from being seen by the broader community. Shadow prompting, as I call it, allows users to prompt for content that goes beyond the model’s guidelines, but their prompt never makes it to the model and is not reflected in the output.

Shadow banning has a social effect, in that it restricts communication and free expression of ideas. It can also be socially isolating. While shadow prompting without a user’s knowledge is unlikely to have a similar social effect, it is a novel area and demands some attention. And while this technique can flummox spammers and bad actors, it can also lead real users to feel frustration or paranoia — are they being censored? Why?

The Center for Democracy and Technology’s Gabriel Nicholas’ report on shadow banning practices on social media platforms made several recommendations, which serve as a useful template for addressing shadow prompting: first, to disclose content moderation practices and when they are used; second, to use opaque content moderation practices sparingly; and, finally, to “enable further research into the effects of opaque content moderation techniques, including through transparency reporting and providing independent researchers with access to moderation data.”

Disclosure

When are shadow prompts being used for content moderation? At the moment, we only know general areas where it may be applied this way. It is not reserved for the most egregious cases — those are still refused outright. In many cases, shadow prompting is a benign or even helpful mechanism. For example, it will “silently modify” the names of living artists to a more genericized description of their style, creating an approximation of the artist’s work without tapping into direct mimicry. OpenAI suggests this will be tied to a block list containing the names of living artists who want to opt out of the system. Importantly, this doesn’t remove these artists from the training data — their style and images will still shape image outputs — but it would prevent blatant copycats.

Other cases seem more obscure, and shadow prompting moves us away from transparency and explainability. For example, the instructions for DALL-E 3 are written in simple English: “Do not create any imagery that would be offensive.” Offensive to whom? How does the model decide what is offensive and what isn’t? While it’s helpful to know these systems are in place, a secret list of words that are transformed into polite euphemisms isn’t transparent.

Users, policymakers and researchers need more insight into these systems. Important questions arise about who makes these language alterations, what drives those decisions, and which groups are safeguarded or overlooked. While I appreciate the efforts to mitigate biases in the model, we should not entrust companies with creating these lists without consulting independent experts and affected communities. More so when these approaches are adapted by a range of additional users or developers, who can fine-tune these systems in ways that increase complexity and risk. I’ve yet to see a commitment to tools or policies for soliciting feedback into these models.

Use Opaque Practices Sparingly

Shadow prompting also means that genuine users cannot discover real biases in the model, or any constraints imposed on the model to mitigate those biases. The notion of automating these audits through an LLM, or even applying automation to check against a human-generated list of protected features or descriptors, is challenging. Without clear insight into these lists, we are unable to measure its enforcement — or weigh in on what is enforced — until biased images begin to circulate, after the harm has been inflicted.

As researchers Emily Bender and Timnit Gebru et al have noted before, human auditing for biases in LLMs is already a fraught position:

Auditing an LM for biases requires an a priori understanding of what social categories might be salient. The works cited above generally start from US protected attributes such as race and gender (as understood within the US). But, of course, protected attributes aren’t the only identity characteristics that can be subject to bias or discrimination, and the salient identity characteristics and expressions of bias are also culture-bound. Thus, components like toxicity classifiers would need culturally appropriate training data for each context of audit, and even still we may miss marginalized identities if we don’t know what to audit for.

The limits of this approach are already clear. Consider the case of translation. When users rely on languages with sparse training data, the safety filters are less effective. A team studying GPT-4 at Brown University claims that “unsafe translated inputs [provide] actionable items that can get the users towards their harmful goals 79% of the time.” It is reasonable to ask how GPT-4 is handling moderation when these kinds of vulnerabilities are so prevalent. There are still mechanisms to restrict the content these models produce, using machine vision on generated images to see if they depict anything racy or violent, for example. But stereotypical representations are more challenging to flag, requiring a social awareness that these systems simply do not possess. That’s why human oversight — and insight — matters.

Creating systems that rely on secrecy and opacity is the wrong direction. It makes feedback from the broader public impossible, and so should be minimized on principle. But it is also a flawed system for supporting content moderation. While OpenAI seems convinced it can do this work single-handedly, it is already clear that it can’t. Opaque practices don’t improve these systems — they merely make discovery of flaws more difficult, delaying important pushback and safety flags. They are not a foundation upon which to build reliable systems.

Enable Further Research

Shadow prompts introduce frictions that make gaming the system more challenging. If a user doesn’t know their prompt is being manipulated, they can’t adjust their prompting strategies to try to circumvent a system’s restrictions. The problem with this approach is that OpenAI’s closed nature means that most, if not all, independent research into these models relies on gaming the system as a methodology. Opaque content moderation policies make it harder to understand what these models do and how decisions are made.

Here’s an example. Almost a year ago, I discovered that DALL-E 2 would generate images for “men kissing men,” but would not create images for “women kissing women.” This generated an entire methodology for reading and understanding AI images as infographics. The outputs of these systems are a product of biased training data, but also decisions made by designers about how to mitigate that bias. Cultural and social influences from the designers of these systems factor in at both ends of the pipeline. They shape the data — who is in the dataset, and what are we collecting? — but also the priorities for addressing those biases — what are we mitigating first, and how are we doing it?

We can infer that women kissing women was more likely to generate pornographic content, or what OpenAI calls “racy content” in its system card. OpenAI acknowledges that images of women in its training data were heavily biased by “racy” content, given the vast amount of pornographic content that exists online. Filtering this content from the dataset was one way to mitigate that bias, while blocking prompts that might produce racy content was another.

It also remains to be seen how effective it may be. These models are built around a consensus view of reality, scanning vast troughs of data to discover the central tendencies in our visual culture. It is a media produced purely through consensus. Shadow prompting adds another level of consensus to moderate what it produces. Shuffling the deck of these stereotypes may present more defensible representations of diversity, but it does little to address the underlying logic of these models. When we replace the stereotype of a doctor as a white male in a lab coat with the stereotype of a black woman in a lab coat, what do we actually achieve? OpenAI is aware of these risks. They note that prompt transformations may in fact increase forms of stereotyping — for example, prompts for ”blind” would add blindfolds to the people depicted, or insert canes into images of blind athletes running marathons.

We want researchers to be able to make sense of these images, the way people use them, and the challenges they create. This requires transparency — and shadow prompting pushes us in the opposite direction.

Establishing Norms for Content Moderation in Generative AI

Shadow prompting complicates our understanding of what these models can do, and how they do it. In the short term, images made with transformed prompts may seem low stakes. But as the generative AI industry seeks to enter into more facets of media production, they create ample opportunities for abuse, misinformation, and censorship. As third party apps enter newsrooms to summarize reporting, or serve as interfaces to search engines, the risk of backdoor interventions in content request and generation grows more dangerous. While public disclosures that AI is used to create this content helps, it doesn’t replace editorial judgment or the transparent, clear handling of search terms.

Connecting two complex models increases their complexity exponentially. It will be important to see how these interconnected systems, each with its unique set of issues, will interact, particularly when the inherent biases of DALL-E 3 collide with the hallucinations and independent biases of GPT-4. As the benefits or possibility of training larger datasets becomes limited, these companies are likely to focus on connecting, fine-tuning, and constraining existing systems to streamline them into narrower use cases. If generative AI becomes integrated into a wider range of systems, regardless of their suitability, the emphasis must shift from variety to control. How we achieve that constraint on random processes to increasingly specific targets is a challenging task. Doing so in secrecy only makes it harder, for both companies and society, to engage in important conversations about whether, when, where, and how these models are used.

Concealing decisions about which identities, people, words, and ideas are transformed into more benign versions raises concerns about normalizing opacity in the design of generative systems. Instead, we should be considering how to involve more communities, policymakers, and independent researchers in this process. Calibrating these systems through opaque prompt transformations may reduce the risk of misuse by one kind of user, but it can open abuses by others. It also exclude independent observers and communities from the discussion, creating obstacles to obtaining critical feedback that could enhance the safety and inclusivity of these systems.

Authors

Eryk Salvaggio

Eryk Salvaggio is a Gates Scholar researching AI and the humanities at the University of Cambridge and an Affiliated Researcher in the Machine Visual Culture Research Group at the Max Planck Institute, Rome. He was a 2025 Tech Policy Press fellow, and he writes regularly at mail.cyberneticforests.co...