LAION-5B, Stable Diffusion 1.5, and the Original Sin of Generative AI
Eryk Salvaggio / Jan 2, 2024In The Ones Who Walk Away From Omelas, the fiction writer Ursula K. Le Guin describes a fantastic city wherein technological advancement has ensured a life of abundance for all who live there. Hidden beneath the city, where nobody needs to confront her or acknowledge her existence, is a human child living in pain and filth, a cruel necessity of Omelas’ strange infrastructure.
In the past, Omelas served as a warning about technology. Today, it has become an apt description for generative AI systems. Stanford Internet Observatory’s David Thiel — building on crucial prior work by researchers including Dr. Abeba Birhane — recently confirmed more than 1,000 URLS containing verified Child Sexual Abuse Material (CSAM) is buried within LAION-5B, the training dataset for Stable Diffusion 1.5, an AI image tool that transformed photography and illustration in 2023. Stable Diffusion is an open source model, and it is a foundational component for thousands of the image generating tools found across apps and websites.
Datasets are the building blocks of every AI generated image and text. Diffusion models break images in these datasets down into noise, learning how the images “diffuse.” From that information, the models can reassemble them. The models then abstract those formulas into categories using related captions, and that memory is applied to random noise, so as not to duplicate the actual content of training data, though it sometimes happens. An AI-generated image of a child is assembled from thousands of abstractions of these genuine photographs of children. In the case of Stable Diffusion and Midjourney, these images come from the LAION-5B dataset, a collection of captions and links to 2.3 billion images. If there are hundreds of images of a single child in that archive of URLs, that child could influence the outcomes of these models.
The presence of child pornography in this training data is obviously disturbing. An additional point of serious concern is the likelihood that images of children who experienced traumatic abuse are influencing the appearance of children in the resulting model’s synthetic images, even when those generated images are not remotely sexual.
The presence of this material in AI training data points to an ongoing negligence of the AI data pipeline. This crisis is partly the result of who policymakers talk with and allow to define AI: too often, it is industry experts who have a vested interest in deterring attention from the role of training data, and the facts of what lies within it. As with Omelas, we each face a decision of what to do now that we know these facts.
LAION-5B as Infrastructure
LAION’s data is gathered from the Web without supervision: there is no “human in the loop.” Some companies rely on underpaid labor to “clean” this dataset for use in image generation. Previous reporting has highlighted that these workers are frequently exposed to traumatic content, including images of violence and sexual abuse. This has been known for years. In 2022, the National Center for Missing and Exploited Children identified more than 32 million images of CSAM online. The Stanford report notes that LAION’s dataset was collected from the web without any consultation with child safety experts, and was never checked against known lists of abusive content. Instead, LAION was filtered using CLIP, an automated content detection system whose designers, Dr. Birhane points out, warned against its own fitness for filtration purposes when they released it.
In my own analysis of LAION’s content — prior to the dataset’s removal — I was troubled by its inclusion of images of historical atrocities, which are abstracted into unrelated categories. Nazi soldiers are in the training data for “hero,” for example. I refer to these assemblages as “trauma collages,” noting that a single generated image could incorporate patterns learned from images of the Nazi Wehrmacht on vacation, portraits of people killed in the Holocaust, and prisoners tortured at Abu Ghraib, alongside images of scenes from the Archie Comics reboot “Riverdale'' and pop culture iconography.
We have little understanding of how these images trickle into the display of these “beautiful” illustrations and images, but there seems to be a failure of cultural reckoning with the fact that these are rotten ingredients.
The knowledge that workers were exposed to traumatic content has, to date, failed to mobilize the industry (or policymakers) to action — to grapple with the kinds of data being collected, and the method of collecting it. The warnings of independent researchers such as Dr. Birhane, who documented the misogynistic and racist content in LAION, failed to spur action. The concerns of artists over copyrighted material held in LAION-5B has yielded a similarly timid response from lawmakers. Had policymakers and journalists taken the concerns of artists and independent researchers seriously, the presence of even more deeply unsettling material would not have come as a surprise.
The news media is also to blame. The way we have framed artificial intelligence since the generative AI boom has been deeply flawed. Rather than understanding AI as an automated form of data analytics, stripped of human supervision, we have seen countless reports on their capacities and outcomes. Pivoting our understanding of data collection and algorithms to the frame of “Generative AI” has unnecessarily severed the understanding of this technology, erasing a decade or more of scholarship into algorithmic systems and Big Data.
This pivot has created a harmful frame shift as policymakers scramble to understand this supposedly “unprecedented” technology. The reason for this error is clear: it has direct benefits to industry leaders. This year, Sam Altman, the CEO of OpenAI, was referenced twice as often as the 42 women in Time magazine’s “Top 100 list of AI influencers” combined. That list includes Dr. Birhane, whose crucial research work exploring LAION-5B has received comparatively little media and policy attention. Meanwhile, the majority of those invited to Senate Majority Leader Chuck Schumer’s (D-NY) AI “Insight Forums” represented industry, including figures such as Altman and Elon Musk.
Industry experts certainly have knowledge to offer. But they also have an interest in steering conversations away from data rights and transparency. The venture investment firm a16z recently announced that “Imposing the cost of actual or potential copyright liability on the creators of AI models will either kill or significantly hamper their development.” In other words: data isn’t worthless, but they want us to treat it that way.
Yet, artists' calls for control over the use of their data in these datasets have largely been ignored. The resistance to opening up training data to scrutiny is hard to isolate from the presence of CSAM within it. In the two weeks since the Stanford report was published, a number of websites which had offered exploratory versions of LAION for artists and independent researchers have taken these tools down.
This makes sense: nobody wants tools that enable child abuse or provide access to these images. But it is a deep irony that the very tools that made it possible for researchers to examine and identify the training data are now offline. That means it is literally impossible for artists and copyright holders to see if their work is being used to train these systems, or for researchers to understand what harmful materials are contained within them. (Another example: a report that showed the dataset contained not only photographs of children alongside easily identifiable location data).
In the race to sweep up as much data as possible, companies have operated in an environment that benefits from obfuscation. Last year was marked by illusions and delusions of artificial general intelligence, the promise of a sophistication that emerges from some abstract concept of “intelligence” in a dense network of on/off signals we call neural nets. There is a lack of seriousness in these conversations, a failure to connect the dots between these systems and their sources. That lack of seriousness is encouraged by the heads of the companies developing these technologies, who directly benefit from confusion about (and even fear of) what these systems are and how they work.
With industry’s goals at the center of policy framing, it’s no wonder that so much media attention has been paid to long-term theoretical risks and techno-solutionist “superalignment.” This is at the expense of a deep focus into real-world training data and processes that shape immediate and direct harms, such as child abuse, racist surveillance and crime “prediction,” and the capture of personal data without consent.
How Should We Frame AI?
What might greater scrutiny over datasets look like? Thiel’s team at Stanford recommends against training datasets on images of children altogether — especially general purpose models that blend multiple categories of images. This is both a data rights issue and a child safety issue. Addressed as a data rights issue, children’s likenesses ought to be protected from data scraping, as there is no way to anticipate uses of their image. As a child safety issue, the risk of reproducing a real child’s face in an AI generated image carries real risks, especially as we see a boom in VC-backed deepfake pornography mills.
It is not enough to simply trust companies which train on these datasets to regulate themselves. It was not Stability AI, OpenAI or Midjourney that reported these findings, but independent researchers. Without searchable, open models, we might never have known. Furthermore, it is far preferable for independent researchers to be able to audit training sets than for companies to withdraw from responsible accounting by withholding access to their models.
Yet, there is a contradiction at the heart of this proposal. Open datasets such as LAION-5B are used by researchers because they are used to train AI models. If datasets are open, many fear, then all kinds of variations can be built, including models specifically designed for deepfakes, harassment, or child abuse.
The tragically overlooked 2021 paper from Dr. Birhane and her coauthors assessed this issue: “Large-scale AI models can be viewed, in the simplest case, as compressed representations of the large-scale datasets they are trained on. Under this light, it is important to ask what should be compressed within the weights of a neural network and by proxy, what is in a training dataset. Often, large neural networks trained on large datasets amortize the computational cost of development via mass deployment to millions (or even billions) of users around the world. Given the wide-scale and pervasive use of such models, it is even more important to question what information is being compressed within them and disseminated to their users.”
The paper poses the challenge to policymakers: Should images of trauma, circulated online in content deemed illegal, be permitted for research or commercialized? If we all agree it should not, then why do we allow vast copies of the internet to be incorporated into AI systems without interventions or oversight? Where should accountability be placed?
The AI Foundation Model Transparency Act, proposed by Reps. Anna Eshoo (D-CA) and Don Beyer (D-VA) just a day or so after the release of the Stanford report, seems like the beginning of a decent compromise. The bill would direct the “Federal Trade Commission to establish standards for making publicly available information about the training data and algorithms used in artificial intelligence foundation models, and for other purposes,” and requests that the FTC establish mechanisms for data transparency and reporting. This would not only give consumers and users of generative AI insight into the content of training data, but would confront generative AI companies with the demand that they understand their own training data. While this bill is focused on copyright management, it is heartening to see legal and policy precedents that place accountability where it belongs.
Accountability is not as challenging as AI companies would like us to believe. Flying a commercial airliner full of untested experimental fuel is negligence. Rules asking airlines to tell us what’s in the fuel tank do not hamper innovation. Deploying models in the public sphere without oversight is negligence, too. Artificial Intelligence systems may be a black box, but the human decisions that go into building and deploying them are crystal clear. Deploying and automating an unaccountable machine is a management and design decision. These managers and engineers should be held accountable for the consequences of building and deploying systems they can’t control.
Likewise, perhaps it is time to abandon the idea that data is nothing but ephemeral debris. Data is firmly at the heart of today’s AI, and the industry would like consumers and policymakers to ignore the thorny questions that surround it. Venture capital and big tech firms benefit when the rest of us undervalue our data. But our data, collectively, is immensely valuable. It holds value under the usual rubric of economics, but also in our social spheres. Data is the mark of our lives lived online. It can be evidence of creative expression, or trauma. If we have any hope to build ethical AI systems, we must think carefully about the ways we curate and harness these datasets. Responsible AI demands more than the vast extraction of our information. It calls for thoughtful approaches and decision making about the archives that shape their outcomes. It demands that we ask who this data serves and who it harms.
That will require a much greater engagement of interdisciplinary experts — which includes communities that grapple with the consequences of automated data analysis. An industry that prides itself on creative innovation should be able to grapple with restrictions on toxic, illegal and violating content. It should aim to build datasets that center consent, respect, and even joy. But without accountability and engagement beyond the tech world, we will never be able to see AI through any other lens but that which industry prefers.
I would never conflate the burden these systems place on copyright holders with the trauma of abused children, and each issue related to data should be handled according to the particular response it demands. But in so many cases, the media and policy community has neglected broader engagement in its scrutiny of the data pipeline. This distorts the conceptual frameworks we use to understand and regulate AI. Artificial Intelligence systems start with data, and policy should too.
Data is a vital piece of our digital infrastructure. Like all infrastructure, it is deeply entangled with our social worlds. Too often, our technological infrastructure is accumulated, rather than designed. But it is worth making time for care and thoughtful dependencies in our digital lives. Otherwise, we risk building a future in which the pain of others is embedded through neglect. We risk building AI as Omelas was built.