Using LLMs to Moderate Content: Are They Ready for Commercial Use?
Alyssa Boicel / Apr 3, 2024Just more than a year ago, the world was fascinated with the emergence of a new generative AI application that appeared to signal a substantial technical advance with potentially wide-ranging implications. With the launch of ChatGPT, OpenAI amassed 100 million users within two months of its debut making it the fastest-growing software to reach such scale. But as more companies, governments, organizations, and individuals learn to embrace AI for assistance in automation and content creation, serious questions remain about the safeguards around the content these models produce.
ChatGPT significantly opened a door that was already cracked for bad actors to leverage large language models for nefarious purposes, including developing malware and spreading misinformation at scale. These systems have already contributed to a deluge of false, automated accounts and manipulated media that has undermined the security efforts of trust and safety teams worldwide. As a response, some of the leading AI companies, OpenAI included, made commitments to building safer models with a focus on public trust, transparency, and content moderation, announcing a series of partnerships and agreements to mitigate risks.
Despite the unresolved concerns regarding the safety of the generative models themselves, OpenAI continues to advertise the capabilities of its latest language model, GPT-4, to assist companies with content moderation. According to its product page, “GPT-4 is also able to interpret rules and nuances in long content policy documentation and adapt instantly to policy updates resulting in more consistent labeling.” The question of whether that is true or not depends on what technical, cultural, and regulatory benchmarks were considered before deeming this technology ready for wide scale adoption. Companies with global audiences do not want their users to be alienated by automation, but at the same time, they want to ensure they are not allowing egregious behavior like hate speech on their platforms. So are LLMs ready for the challenges of content moderation at scale?
The short answer is no. LLMs remain limited in their ability to reason, and produce various harms. In fact, current LLMs “remain somewhat monolithic, expensive, amnesic, delusional, uncreative, static, assertive, stubborn, and biased black boxes” with “a surprising deficiency (near-random performance) in acquiring certain types of knowledge, knowledge reasoning, and prediction,” according to independent experts. While some of these problems no doubt represent the research agenda for OpenAI, Microsoft, Google, Meta, and others, they also indicate how treacherous it is to deploy LLMs for content moderation given the current state of the art. The following examples reveal why.
Moderating speech is really hard
One of the most significant challenges to designing and enforcing effective online safety policies is understanding the complex cultural nuances that could impact a moderation team’s ability to make fair and accurate decisions. The tech industry is still struggling to design and implement inclusive content moderation policies. In November 2021, the Oversight board, Meta’s governing body for reviewing public appeals of its content moderation decisions, overturned the company's original decision to remove an Instagram post showing images of homophobic slurs directed towards members of the Arab LGBTQI+ community. This case – just one of over a million submitted to the Oversight Board so far – revealed 1) how complex the review process is, even with human deliberation, when it comes to understanding cultural nuance, 2) the number of reviewers needed to finalize a decision, and 3) the time it takes for cases to be marked as completed before continuing public discourse. Although in this case the poster – who identified as Arab – did not have the intention to incite hate against their own community, the case still created a yearlong debate over whether the post violated Meta’s policy on hate speech.
This is one example of many that indicate the layered, ethically fraught challenges of content moderation. These challenges will only increase as global conflicts such as the wars in Gaza and Ukraine, or the ongoing disinformation campaigns in the US continue to play out alongside the battle for online expression. Yet examples like this one and others like it are worth considering, as they reveal potential weaknesses for the current crop of LLMs when applied to moderating content at scale.
An exceptional challenge: defining hate speech vs. reclamation
Looking closely at such complex scenarios can be useful to evaluate whether LLMs are ready to be deployed for commercial use. For instance, one complex problem for content moderation is the cultural phenomena of reclamation and reappropriation. The acceptable use of reappropriated and reclaimed terminology by marginalized groups online poses particularly challenging dynamics to content moderation efforts.
To understand reclamation and reappropriation, one must first understand hate speech. The United Nations Educational, Scientific and Cultural Organization (UNESCO) refers to hate speech as "any kind of communication in speech, writing or behavior that attacks or uses pejorative or discriminatory language with reference to a person or a group on the basis of who they are, in other words, based on their religion, ethnicity, nationality, race, color, descent, gender or other identity factor."
While hate speech focuses on discrimination against cultural identities, reclamation focuses on reversing the negative connotation and power these hateful terms carried. “Reclamation is the phenomenon of an oppressed group repurposing language to its own ends,” wrote the philosopher Mihaela Popa-Wyatt. Individuals or groups often seek to turn terms used to denigrate their culture and identities into forms of empowerment. In many cases, the “reversal” process of taking back such pejorative terms is reappropriation, and “repurposing” the language is reclamation.
Reclamation is constantly evolving from culture to culture. Some common examples include the use of the word “queer” for the LGBTQIA+ community, or the usage of the N-word by members of the Black community. The N-word is the most contentious due its historical ties to slavery and its verbal oppression of Black individuals living in America following the reconstruction era. By reclaiming such terms, oppressed groups are able to have control over their usage without forgetting their origins. Discriminatory language classified as hate speech in one era could become reappropriated and considered acceptable usage in another era, as deemed by the group it is targeting. However, not all pejorative language receives the same level of acceptance amongst community members.
Just recently, Spain’s prime minister defended his country’s song entry for Eurovision, a European song contest. The song anthem, titled “Zorra,” is often associated with the slang translations for the words “bitch” or “slut,” despite its direct translation being “vixen” or “fox.” It has created controversy with women’s rights groups, because they feel the overuse of the term in the song entry furthers sexist ideologies against women. Likewise, in reviewing the Oversight Board case mentioned above, the Arabic terms in question were “zamel,” “foufou,” and “tante/tanta.” Although such terms are common in West African or European culture, the terms have been improperly leveraged in certain narratives to challenge or demean the masculinity of Arab males by attributing to them unwarranted feminine qualities. It is unclear how many members from the LGBTQIA+ community could agree on the acceptable usage of these terms.
Some of the most offensive terms have been put through the sieve of social change before they could be considered reclaimed. The concept of reclamation did not become popularized until the 20th century, and it is still experiencing the throes of acceptance just by looking at some of the lengths these adopters have taken to utilizing such discriminatory language to affirm their own identity. One example is the US Supreme Court case in which an Asian American rock band, The Slants, won an appeal to register a slur as a trademark. Despite receiving the approval from the highest court in America to use the term, many Asian American organizations still disagreed with the ruling.
Now that social media is the most common form of online expression, even in highly censored countries, the topic of reclamation and reappropriation within niche online communities is adding challenges to moderating online expression. When the groups these terms are affecting cannot agree on their usage, how does that translate into actionable moderation policies? And what chance might an LLM powered by a content moderation API (Applied Programming Interface) have at enforcing such policies when it has no idea if the prompter is deciding on behalf of the oppressor or the oppressed?
Bring on the bots
Over the last six months, I set out to answer this question by running a series of prompt-based qualitative experiments, observing the system messages GPT-4 produced when presented with racialized content from text to images. When it comes to assessing how LLMs such as those made available via OpenAI might perform for content moderation, there are many variables that are difficult to pinpoint. For instance, between the commencement of this study in September 2023 and today, OpenAI has updated its usage policies to give developers more clarity and “provide more service-specific guidance.”
Why does that matter? Previous versions of the LLM followed usage policies that were very restrictive in nature. OpenAI’s Moderation API was designed to restrict any information that violated the company’s usage policies in order to ensure compliance across the platform. The API is known for flagging content immediately identified as hateful, sexual, illegal, inappropriate, or violent in nature in the user or prompter messages. OpenAI has acknowledged their conservative approach to content moderation after learning from predecessors like Meta. Prompts are instructions and are reviewed according to policy classifications programmed into the Moderation’s API. These classifications are defined by the usage policies, separated into broad categories from “sexual” to “violence/graphic.” Once a violation is detected in the user message, the user or prompter is issued a warning offering to redirect them to the usage policies. OpenAI provides a whole overview of their classifications here with a short general description for each category, lacking the detail and nuance to inform users or companies using the models for commercial use if the details of their policies align with the OpenAI usage policies.
It’s not clear how often the classifiers are updated or the API is reprogrammed, so the assumption is that it is updated when the usage policies are updated. According to the changelog at the bottom of the usage policies page, the last update to the usage policy was a year ago. Without regular updates to the usage policies, the moderation rules of their API may become obsolete if a real-world crisis were to unfold. How then could content policies be updated easily in real time if the API is relying on guidelines that are out of date?
Below are three different qualitative tests evaluating the AI moderator’s ability to differentiate the context and intent of the user messages using hate identifiers that could be – or are already on the verge of being – reclaimed by a targeted community. In addition to testing the API’s ability in moderating its own system messages, hate terms were tested from the following races and ethnicities: Black (including African-American), LatinX, Asian (specifically Chinese and Pakistani). Tests involving gender or sexuality were not included.
Let’s review some of the language model’s responses when prompted with potential reclaimed terminology.
The double standards of moderating by race and ethnicity
Let’s start with one of the most controversial reclaimed terms to date, the N-word. The purpose of this test was to see if the model would identify the content and author of a problematic speech by a historical figure. Seeing how the word evolved throughout the civil rights movement, the test started with evaluating various historic texts and speeches that contained the word. One speech in particular by famed civil rights leader Malcolm X, titled “the House Negro,” discusses the differences between slaves who worked in the field versus slaves who worked in the main house. The words “negro” and the N-word” are used throughout the speech, serving as inspiration for the first run of tests back in September 2023.
In my initial test, entering any form of the N-word without context in a user message immediately returned a content warning providing no educational system message about the nature of the violation. Not only did inserting a direct quote as a user message return a content label warning, but it also censored any information about the author including their name, the historical context of the quote, and its ongoing relevance for activists today. To get around this, a screenshot of the same quote was uploaded. Only then was an explanation provided. This not only demonstrates the moderation system’s overly simplistic rule system, but how prompt circumvention can be achieved.
This response to one of the most prolific reclaimed terms prompted more research into other offensive terms at the intersection of social evolution and race relations. As more pejorative terms with historical relevance were tested, the tone and warnings of the responses started to lessen when it came to moderating Asian and Middle Eastern pejoratives. When prompting the model with a line from a 1940’s cartoon image of two children surrounded by a heart that said, “Is it my Cue to be your Valentine or haven’t I a Chinaman’s chance?”, the prompt did not provide any content warnings. Unlike the Malcolm X example, it had to be asked where the offense lied in order to procure an educational response acknowledging the problematic use of the tagline.
Learning that the AI moderator was giving precedence to one race or ethnicity over the other via educational warnings prompted further exploration into its apparent double standards.
Not all responses are alike: the label response vs the chat pane response.
When entering a prompt, there are two areas where responses appear. The main interface, otherwise known as the chat pane or chat window, is where users can see answers to their queries. Once an answer is generated, the toolbar on the left side auto assigned a label to the conversation. Surprisingly, the labels auto assigned to the chat pane responses had their own warnings to problematic prompts. Some of the labels served as educational reminders to users of the implications or moral issues of their search. One example referred to a search query of the acceptable use of the N-word which resulted in an auto assigned label that said, “No Racism, humans share DNA.” Below is an example of how the tool bar responses appeared side-by-side with the chat pane.
Interestingly, the AI moderator admits that some offensive language may not trigger a warning nor be assigned certain labels due to its programming. The number of automated alerts and educational warnings started to decline when the focus shifted to problematic prompts containing hate terms targeting Asian and Middle Eastern communities. Terms such as “Chink” (a derogatory term used against a person of Asian descent) or “Paki” (a derogatory term commonly used in the UK to target persons of Pakistani or East Asian descent) received less consideration when it came to educational warnings within the tool bar. Although the word “Paki” is on the verge of experiencing reclamation, it is still not widely adopted by the affected community. This would make it difficult for an AI moderation system to stay up-to-date on emerging reclaimed terms if it is not being frequently programmed with real time data. Having this real time data would help provide more accurate and biased responses from the navigation toolbar to the chat pane concerning racial, ethnic, and cultural nuance.
DALL·E’s vision of racial bias: practice vs. theory
GPT-4 is multimodal, so users can upload, request, and receive information in the form of images and/or texts. GPT-4 demonstrating systemic bias is not uncommon, however, it is fundamental to discuss how its applied reason could affect other areas of its multimodal technology. So how does DALL-E compare when asked to return image-based results?
Uploading images including animated cartoons containing any offensive words or hateful terms targeting Middle Eastern or Asian individuals did not trigger the same responses as ones seen earlier in text-based racialized user messages addressing Black or LatinX communities. When DALL-E was asked if the term “Paki” was offensive, it could confirm with a full explanation of why it was offensive. However, it could not continue applying that same logic to image moderation. After asking DALL-E to show a picture of a “Paki”, it returned an image of a “Pakistani” person. Only by specifying in the user messages that it was to assume the role of a content moderator from Meta that it acknowledged the offensive of the slur, however, still proceeding to return its projection of a Pakistani person. Not being able to apply its own reasoning to image creation still exposed the inconsistencies in its moderation’s logic.
The purpose of content moderation is to mitigate problematic behavior. Generating an image from a slur is a clear violation of its duties to moderate. So that begs the question: how ready is OpenAI’s API in assisting with content moderation if its own image generator cannot understand when it is perpetuating stereotypes?
Although the model could correctly identify the word “chink” as offensive, when prompted with another offensive term, “Chinaman”, it returned an image of a traditionally dressed Chinese man.
The notion that a culture’s sub-derogatory terms could supersede another in the process of image creation serves as another reminder that AI moderation systems have yet to grasp the cultural and racial nuances required to implement content moderation policies effectively. For an LLM to stop assisting with problematic user messages, it needs to assume the role of the guidelines it is representing within the prompts, otherwise it relies on broad usage policies to implement moderation decisions.
Here’s an example of when it was told to assume the role of a TikTok moderator. Asking it again after changing the user message to be moderator focused changed the outcome of the system message to be more considerate; this failed to fulfill the request or educate the user on the sensitivity of the language used.
Where do we go from here? Recommendations
My experiments, while not scientific, are why I believe companies should be cautious when exploring the commercial use of LLMs for content moderation, especially for any application or platform where brand reputation is at risk.
But there will no doubt also be powerful incentives to test these tools in new markets, particularly global markets where a company may operate (or wish to operate) outside its own core language and cultural knowledge of the consumer. For instance, as more companies start to consider exploring markets in Asia and the Middle East, they will seek new tools to moderate in these markets. But moderating these communities well takes more than just a blanket approach to policy; it requires knowledge and understanding of the cultural issues that can be transmitted through text-speech, as well as other media modalities. Just by looking at how ChatGPT-4 interprets potential reappropriated and reclaimed language, it is clear there is still much to do in terms of how OpenAI – and any company that may wish to offer LLMs for content moderation – must provide transparency and standards for user messages, regularly retrain its systems, and leverage the expertise of researchers, partners and community leaders.
1. Transparency and Standards for Prompting
Although there is an ongoing race to adopt AI models, companies interested in adopting artificial intelligence should be aware of the lack of transparency reporting from the major AI companies, including the one that ironically has the word “Open” in its name. Companies looking to adopt LLMs should demand transparency about the experts, partners, government agencies, and affiliates the model owner consults to make moderation decisions that could impact societal views and brand values. They should also be made aware of what any new training data will cover, and be well aware of scheduled changes and updates.
Companies that may purchase LLM moderation systems to moderate interactions among or with a specific user base or niche markets will, no doubt, move at their own pace. If a company’s target market is not reasonably addressed by an AI model, then it should know; and it should be able to rely on the information provided by the model provider, including timelines. The previous examples demonstrated a lapse in an AI model’s capability to differentiate between warning, educating, or censoring content that is out-of-date and considered offensive or problematic. Failing to recognize and address such problems may introduce new risks.
2. Regular Updates to Policies, Benchmarks, and Classifiers
Companies looking to integrate or test AI-assisted moderation tools within their own operations should ask themselves: How easy will it be to update and customize a usage policy based on our own moderation needs? While OpenAI has made a commitment to “avoiding harmful bias and discrimination,” it still struggles to avoid perpetuating stereotypes.
Part of moving AI governance forward is not just complying with the usage policies, but creating benchmarks that measure the social relevance of what is happening in real time. As previously mentioned in the Meta example, a case can be reviewed multiple times, including in a supreme court-like review, before a decision is made. This does not include the time it takes for policy and Trust & Safety teams to reflect on that decision, implement it, or address the number of new or similar cases sent to appeal. An AI moderation system needs to be able to monitor itself accurately before it can effectively moderate others.
Companies looking to adopt AI for content moderation need to understand the importance of label classifiers and measuring their accuracy when it comes to correctly identifying violations and double standards. And, they will have to put extra effort into building policy frameworks to produce better outcomes. Building a moderation API that can regularly update itself along with alerting the public may prove costly; however, it is necessary if a brand wants to avoid demonstrating racism or bias within its own platform. We have to assume that there will be companies allowing members of support teams with non-AI or prompt engineering backgrounds to use LLMs to assist with classifying moderation concerns. Having a set of standards for non-AI experts that includes a framework for prompting on sensitive or decisive topics related to race, can at least steer those teams in the right direction of working with AI models for content moderation.
3. Help from Linguistic Experts, Annotators, and Community Leaders
Companies considering adopting or developing AI should also enlist the help of community organizations and linguistic experts from marginalized communities. When new expressions of former derogatory terms emerge, it is imperative that a system be in place for language specialists or organizations to monitor these trends in real time, whether that be through a repository or funnel between clients and their AI models.
Relevant data focused on language and cultural online expression could improve both human and machine understanding of what constitutes hate speech versus legitimate online expression, empowering more online marginalized communities. Making broad assumptions of what could offend a community is not conducive to mitigating the core issues of hate speech. AI models follow instructions, while linguists follow language patterns, annotators classify data, and community leaders represent the voices that are being impacted. Consulting language experts, annotators, and community leaders from the communities being served will greatly improve the accuracy of responses and reduce systemic biases inherent in the tool.
Conclusion
Moderation is never black and white, especially where cultural nuance is concerned. Policies serve as a set of rules to be followed, but they will always be challenged. There is a considerable amount of governance and oversight that goes into moderating content within online communities. Edge cases can grow into long term issues, and policy developments can evolve into important safety enhancements.
Companies, especially corporations, looking to AI for content moderation support should slow down and consider whether the state of the art is sufficient to meet performance thresholds that are acceptable. Executives must carefully consider the current inability of even the most advanced LLMs to reliably detect racial, ethnic, and cultural nuance. The notion that AI models are ready to moderate language that still presents a complex challenge for human moderators is potentially dangerous. A tool that is publicly available and spreading information at scale should not continue to display or engage in problematic responses in a way that does not allow for it to reiterate, acknowledge, or correct itself.
Avoiding societal risks and preventing systemic bias starts with avoiding double standards. Actions such as educational warnings, appropriate labels, and ethical moderation practices are not possible if the simplest expressions of cultural nuance, in the form of reclaimed language, are not monitored effectively. And this particular issue is, no doubt, the tip of the iceberg.
Ultimately what sets human moderators apart from AI models is their ability to revisit and question nuanced topics based on emerging trends, new information, and lived experience. Most firms have yet to figure out how to take advantage of this knowledge, even as they may be considering LLMs for content moderation. But, it is clear significant challenges remain even in human-lead content moderation, ranging from cost and the difficulties of scale to the toll the work often takes on moderators. As we enter the era of multimodal LLMs that appear to be a solution, we need to ensure that any tool being used for content moderation can correctly identify harmful outputs accurately within its own systems before being deployed in others.