Home

Donate

An Advocate’s Guide to Automated Content Moderation

Dia Kayyali / Feb 12, 2025

Dia Kayyali is a fellow at Tech Policy Press.

“We used to have filters that scanned for any policy violation. Now, we're going to focus those filters on tackling illegal and high-severity violations, and for lower-severity violations, we're going to rely on someone reporting an issue before we take action… We're also going to tune our content filters to require much higher confidence before taking down content.” -Meta founder and CEO Mark Zuckerberg, January 7, 2025

When Meta announced changes to its platform on January 7, press and civil society reactions focused heavily on Meta’s decision to end its fact-checking program in the United States, followed by its relaxation of content moderation policies on issues such as hate speech and misinformation. However, the announcement also made it clear that Meta will ramp up its use of generative AI while reducing other types of automated content moderation and revert to relying mainly on the anachronistic practice of user reporting.

For years, Meta has eagerly deployed automated and AI-enabled content moderation systems and invested substantial resources in building them, so this is not a small shift. In fact, Meta’s change in approach may represent a change in industry norms as the political and regulatory debate over the moderation of online speech shifts. If a new normal is set to emerge, what role will technologies such as AI and machine learning play?

Automated content moderation is not new. Over the years, automated moderation has been variously criticized, lauded, and described in sometimes technically inaccurate terms. Platforms have frequently refused to provide details or made misleading statements about it, making it harder for policymakers, civil society, and the general public to understand how it impacts society. In fact, anyone who has engaged with staff at social media platforms in the last few years has almost certainly heard them say that civil society “just doesn’t understand” how platforms operate. Civil society and lawmakers have sometimes pushed for more automated moderation–or called for less of it–without understanding precisely what they are demanding.

Even as the technical and policy dimensions of content moderation shift, it is important to ground the debate and for all participants in it to have enough technical fluency to engage in meaningful discussions. Here, I attempt to provide a guide to terms and issues at the intersection of automation, AI, machine learning, and content moderation.

Content moderation at scale

Content moderation encompasses the policies that govern content and the means used to enforce those policies, including various remedies such as content labeling or removal to account suspension or limits on distribution. Content moderation at scale refers to complex content moderation systems designed to enable platforms to deal with very high volumes of content. These systems can be designed to work with automation and AI, refer content to human moderators, and give them tools to evaluate it and apply remedies.

This article does not address the many concerns related to contract human moderators, such as outsourcing to low-paid and poorly treated contractors, trauma, insufficient instructions, political bias, and more; nor does it dive into shadowbanning or how platforms rank content on feeds. However, policy professionals need to be acquainted with these issues, as well.

Automation, artificial intelligence, and machine learning

Automation refers to fully or partially removing human intervention in the content moderation process. AI is only one of many technologies employed for this purpose. Rule-based prioritization for user complaints, pattern matching on contents, and even the interfaces and tools provided to community managers and contract human moderators can all be used to sift through content at scale.

For example, platforms use a technique known as “hashing,” a form of “fingerprinting,” to identify audiovisual media that is very similar to content that has already been identified. Images and videos are cryptographically converted to hashes–alphanumeric strings of data that can be matched to other instances of that content without storing the original. In 2007, YouTube announced the release of Content ID, a content-matching system for potentially copyright-infringing material. In 2009, Microsoft and computer scientist Hany Farid developed PhotoDNA, a sophisticated form of hashing used to detect child sexual abuse material. Later, in 2016, Microsoft, YouTube, Google, and Facebook announced the creation of a hash database for “terrorist content,” which became the Global Internet Forum to Counter Terrorism.

Artificial intelligence is a disputed and broad term. It generally refers to computer systems that can simulate human intelligence, but this can mean many things, so it is important to be specific. This article is concerned mainly with machine learning and generative AI.

Generative AI, such as ChatGPT, is trained to generate more of the type of material it was trained on, most commonly text or images. ChatGPT is an AI chatbot application based on a large language model. Large language models (LLMs) are a subset of generative AI trained on vast amounts of data to understand and generate human language. Applications of generative AI to content moderation are beginning to emerge. Meta’s January 7 Newsroom post on changes to its policies noted that it has “started using AI large language models (LLMs) to provide a second opinion on some content before we take enforcement actions.” Some researchers argue that on a technical level, because these models are trained on such vast amounts of data, they could more accurately apply many content policies at once. Unfortunately, many of the underlying issues outlined in this article also apply to generative AI, including that these models can reflect and even amplify underlying social biases.

Despite the hype around generative AI, the most important term to understand automation for content moderation is still machine learning. Machine learning “involves using statistical learning and optimization methods that let computers analyze datasets and identify patterns.” Machine learning can be “supervised,” where it is based on labelled data sets. “For example,” says one explainer from MIT, “an algorithm would be trained with pictures of dogs and other things, all labeled by humans, and the machine would learn ways to identify pictures of dogs on its own.” It can also be “unsupervised,” where it “looks for patterns in unlabeled data.” For example, a platform could use unsupervised machine learning to proactively identify patterns in political posts after an election or to look for new forms of fraud. It’s important to note that even unsupervised machine learning is still trained by humans.

Perhaps the most well-understood application of machine learning for content moderation is the use of classifiers. These models sort data such as social media posts “into different categories or classes, based on a given set of features or attributes.” Platforms can use datasets to train classifiers to recognize different types of content that violate the law or violate the platforms’ policies.

As noted above, platforms publicly acknowledged using such technology at a large scale on CSAM, starting in 2009 and as early as 2007, used some level of automation for proactively detecting potential copyright infringement. These automation forms were just looking for duplicates and near-identical duplicates of previously identified content. However, in June 2017, machine learning took a significant step forward with the publication of a landmark research paper by Google scientists that made machine learning systems easier to train while also potentially more effective. Days later, Google announced it would “apply our most advanced machine learning research to train new ‘content classifiers’ to detect and remove ‘terrorist and violent extremist content.’” Machine learning very rapidly became a core feature of content moderation, expanding to more and more types of content over the next few years.

These systems grew increasingly more complicated, deploying multiple layers of automation and machine learning that can take action in various ways. For example, where platforms use AI to detect child sexual abuse content, they are likely to take that content down immediately and then share the hash with the National Center for Missing and Exploited Children (NCMEC). The hash can then be used by other platforms, which may combine them with their own automated moderation systems. Platforms are also likely to suspend a user’s account or at least freeze features such as posting or commenting when taking down CSAM.

On the other hand, platforms have also used hate speech classifiers that, at a lower threshold of confidence (see definition below), could send content to a human reviewer and only at a very high level of confidence automatically remove content. These systems can also take other actions. For example, classifiers can be focused on determining whether a piece of content is graphic and, if it is, can automatically add a graphic content warning screen that users have to click through to view the material.

How can the use of automation for content moderation be assessed?

The term accuracy has a specific meaning in machine learning, though it is often used to refer to a variety of metrics that may be used to assess the use of automated systems in content moderation, including accuracy, precision, and recall.

  • Accuracy looks at how well a model does overall by looking at the proportion of all classifications that were correct, whether positive or negative.
  • Recall, also referred to as the “true positive” rate, refers to “the proportion of all actual positives that were classified correctly as positives.” Recall is a way to measure whether a model gets every instance, but it does not consider false positives.
  • Precision refers to how often the positive predictions are correct and measures the sensitivity of the model by considering both true and false positives. Another way to think of it is that recall is focused on ensuring every positive instance is classified, and precision is focused on ensuring the model is finely tuned.
  • Threshold of confidence refers to how confident a classifier has to be before taking action.

All of these terms are important because a platform could easily claim, to a lawmaker or civil society representative, that a classifier is “99% accurate” and be referring to any of these measurements. But as some academic and civil society experts on content moderation have pointed out, “For transparency to be meaningful, it has to be targeted—not just increasing information, but communicating in a way that can be used to help hold decision makers to account.” Understanding what each of these measurements means and pushing platforms to explain themselves in technically precise terms is necessary for meaningful insight into what factors are being balanced when platforms deploy automation–and what communities will bear the brunt of any mistakes.

This is important because, as Meta founder and CEO Mark Zuckerberg noted in his January 7 announcement, the European Union has pushed platforms to deploy more and more forms of automation through legislation like the Terrorist Content regulation and some sections of the Digital Services Act (DSA). However, the EU also expects platforms to protect fundamental rights while doing so. But how can these measures be assessed against rights if they are described in misleading terms or in a completely opaque manner? This is part of the reason why the DSA, the UK’s Online Safety Act, and other Internet regulations also have transparency provisions that attempt to help regulators better understand how platforms are using automation, though they must go beyond transparency to pushing platforms to appropriately balance mistakes in content removal with the harms that content removal is meant to prevent.

What problems does automated moderation present?

Long before Meta’s January 7 announcement, many large platforms lauded their own efforts to automate content moderation in announcements and to lawmakers. They often did this while admitting that these technologies make mistakes- one number repeated many times, most recently in the January 7 Newsroom post, is a 10-20% error rate. But it is highly unlikely that even this hypothetical 10% error evenly impacts all users. Instead, these tools have been developed and deployed in perhaps the worst way possible—to fit “average” users in the “Global North.” They have been designed largely in a US-centric context, in an opaque manner, and often with little regard for human rights, including corporate human rights obligations.

Reams have been written about how biases are baked into AI and about how AI cannot handle diverse languages and cultural contexts. AI underperforms with many languages that are not written in Latin characters or so-called “low resource” languages. AI also appears to have higher error rates in detecting incitement to violence, hate speech, and reclaimed slang in contexts outside the United States and Europe.

AI can move beyond its initial dataset, but whatever it is initially taught on will impact how it works. For example, the algorithms used to detect terrorist and violent extremist content on major social media platforms are explicitly trained to remove related to organizations designated as terrorists by the United States, which does not include far-right groups. The inclusion of far-right groups has increased over the years, but these algorithms were likely trained on a disproportionate amount of content in Arabic with a focus on the Arabic-speaking and Muslim world. This is perhaps why, when Google started using machine learning to detect and remove terrorist content in 2017, civil society organizations noticed and started documenting vast troves of human rights documentation disappearing from YouTube and other platforms.

Platforms have misled lawmakers, civil society advocates, and the general public about the accuracy, efficacy, and configuration of these tools, often relying on lies of omission or a lack of technical or detailed understanding to let them share misleading statistics. For example, when Google claimed in August 2017 that “[o]ver 75 percent of the videos we've removed for violent extremism over the past month were taken down before receiving a single human flag,” it did not explain how it was defining violent extremism, nor share any information about the number of false positives caught up in that wide net. It was left to civil society to answer that question as best it could, and the answer was “a lot.” For example, Syrian Archive’s Lost and Found project helped to restore over 650,000 pieces of human rights documentation from various platforms through direct advocacy with platforms, and has documented the removal of videos it has archived from YouTube- for example, an April 2019 report found that 13% of the Archive’s videos were no longer publicly available.

Some of this duplicity has been publicly exposed. A trove of documents leaked from Facebook in 2020 included sobering statistics about the platform’s moderation in Arabic, including the details that “algorithms to detect terrorist content incorrectly deleted non-violent Arabic content 77 percent of the time.” Another document related to Afghanistan specifically showed that a lot of hate speech categories were not appropriately translated into local languages and that only .2% of this content was taken down by automation.

It is worth noting that while Meta has faced an enormous number of leaks, it made some efforts to explain how it uses automation in its Transparency Center. The Oversight Board has also revealed a lot of information through its cases, in particular its deep-dive “policy advisory opinions,” Meta admitted to its Oversight Board that the most moderated word on Meta platforms was “shaheed,” a term that has multiple meanings in Arabic. This demonstrated one way Meta may mislead civil society; many civil society advocates from Arabic-speaking countries expressed explicit concern about Meta’s over-moderation of the word over several years. Meta gaslit advocates about this problem despite how obvious these improper content removals were.

What next?

When it comes to content moderation, intellectually honest people have often disagreed about what content should stay up and what should come down, as well as whether content moderation at scale can really be done in a way that respects human rights or even satisfies most people. Some advocates have long claimed that trying to improve big platforms like Facebook is a waste of time. I used to believe that big social media platforms could be improved while we were investing in alternate ways to provide global town squares.

Now, I’m no longer sure. Frankly, in recent months, major platforms have taken an overtly fascist turn, and they are not worth anyone’s effort. Their leaders have misled civil society and the public countless times and have often reneged on even the small improvements they’ve deigned to make.

But in a world of soundbites, responding accurately and thoroughly to Meta’s recent announcements has been hard. The fact is that automated content moderation is not simply going to go away because President Donald Trump and his supporters want it to. Platforms have invested enormous amounts of time and money into developing what are now massive, sometimes very poorly coordinated systems. Doing away with automation entirely on centralized, massive platforms is, in fact, a dangerous experiment that could have unpredictable impacts on society, including potential offline violence or other harms. However, it is disingenuous to talk about that without acknowledging that platforms have been irresponsible, reactive, and secretive in their deployment of automation.

Instead, now is the time to imagine something better–new social media spaces designed by and for the people who use them and based on consent rather than surveillance capitalism. The solution may be decentralized social media platforms. On these platforms, users’ data can live on different servers connected by an open, shared protocol that allows them to communicate. Of course, if decentralized platforms do not have meaningful moderation they are just as likely as a centralized platform to be full of incitement to violence, disinformation, and worse.

Content moderation in new social media spaces doesn’t have to exclude AI completely. Just like every current application of AI, AI-enabled content moderation is not yet sophisticated enough to shed the biases of underlying models, but it could be designed and deployed transparently and with human rights in mind. It could be created with the involvement of users at every step of the way, from writing policies to defining and creating data sets to configuration of automated systems. In fact, the Design from the Margins methodology, which is centered on community-based research and ongoing relationships, provides a way to “center the most impacted and decentered users, from ideation to production.”

If it is carefully designed and user-centric, perhaps automated content moderation will one day be the force for good that so many have claimed it to be.

If you’re the type of person who wants to read even more, here are some additional resources:

The author wishes to thank Professor Nicolas Suzor at the Queensland University of Technology Digital Media Research Center for reviewing this piece.

Authors

Dia Kayyali
Dia Kayyali (they/them) is a member of the Core Committee of the Christchurch Call Advisory Network, a technology and human rights consultant, and a community organizer. As a leader in the content moderation and platform accountability space, Dia’s work has focused on the real-life impact of policy ...

Related

Syllabus: Large Language Models, Content Moderation, and Political Communication

Topics