Who Authors the Internet? Analyzing Gender Diversity in ChatGPT-3 Training Material
Jessica Kuntz, Elise Silva / Oct 9, 2023Jessica Kuntz is the Policy Director for the University of Pittsburgh Institute for Cyber Law, Policy, and Security. Dr. Elise Silva is a Postdoctoral Fellow at the University of Pittsburgh where studies information ecosystems at Pitt Cyber’s Disinformation Lab.
"Thanks to Barbie,” the narrator cheerfully proclaims, “all problems of feminism have been solved!” Except, as her audience well knows (and Barbie realizes in due time) – they’re not. The summer blockbuster struck a collective nerve because, for all the social progress, the biases, inequities, and indignities of being a woman in the United States remain frustratingly entrenched.
From AI avatar apps that churn out sexualized or nude pictures of female users to text-to-image models that adopt a male default when instructed to produce images of engineers and scientists, AI has a funny way of revealing the social biases we thought we’d fixed. Women currently in the workforce were raised under the banner that we could be whatever we wanted to be. But the subtly pernicious outputs of the AI models poised to impact hiring decisions, medical research, and cultural narratives illuminate the durability of gender stereotypes and glass ceilings.
It was with this in mind that we wanted to dig into the training data of large language models (LLMs), with particular attention to the authorship of those texts. What percentage of the training materials, we wondered, are authored by women?
Sociolinguists have documented distinct gender-based differences in writing and speech patterns: women use more pronouns in what researchers describe as an involved style of writing, in contrast to men’s more informational style, characterized by more specifiers and common nouns. Unsurprisingly, gender also impacts the lens through which each of us experiences the world and what aspects we choose to document. If, as we suspected, authorship of LLM training data slants male, is it any surprise that AI models remain biased against female job applicants and reproduce gendered concepts?
We estimated that just over a quarter – 26.5% – of ChatGPT-3 training data was authored by women. In our paper, we worked from Open AI’s disclosure of the model, trained on filtered data from Common Crawl (an open-source resource that markets itself as “a copy of the web”), scrapings for undisclosed eBooks and other books, Wikipedia, and upvoted Reddit links. We’ll be the first to acknowledge the limitations of our methodology, operating with very limited corporate disclosure. We found ourselves having to make repeated assumptions about the representativeness of small snapshots of training data and vaguely educated guesses about the true contents of training corpuses – encapsulating the challenge of conducting research inside the LLM black box.
Acknowledging the limitations imposed by the lack of industry transparency, our findings indicate a significant underrepresentation in female voices and perspectives in AI training data. This is the latest manifestation in a long history of minimizing women in data collection and analysis – often unintentional, but harmful nevertheless. The longstanding gender data gap is both cause and consequence of a social “male unless proven otherwise” default; one that is now becoming entrenched and obscured within AI models.
Developers will likely respond that they are aware of model biases and are working to correct them. Recognition of the problem is a welcome step, but these corrections have thus far occurred at the fine-tuning level, only after journalists or researchers expose the biased outputs. This approach is emblematic of the industry assumption that “toxicity and bias contained in the pre-training data can be sufficiently contained via fine-tuning, turning LLMs from unsupervised monsters into helpful assistants.” By failing to address much of the source of the bias, these Band-Aid solutions allow bias to persist in ways yet undiagnosed.
LLMs currently have a seemingly endless appetite for data, with each model trained on more parameters than the last. Our findings on the underrepresentation of women authored content in the training data, and the resulting biased outputs, prompts us to urge a more intentional approach towards training data selection – one that places emphasis on data quality and representativeness. Training LLMs on a diverse and representative dataset, to include gender – but also race and socioeconomic status – won’t resolve every instance of bias, but it would represent a meaningful step in debiasing these models. LLMs mirror the inputs they are fed; it is imperative that we select inputs reflecting the sentiments and diversity of perspectives that we wish to see reflected in the outputs.
Our experience also serves as a case study of what is lost when disclosure and documentation of AI training data is lacking. The lack of transparency surrounding training data makes analyses like ours exceedingly difficult, thus barring researchers, policy makers, and members of the public from understanding fully how these models work. Just as we cannot quantify with precision the percentage ChatGPT training data authored by women, we also can’t know how much of the training data contains misogynist, white supremist, pornographic content or conspiratorial text. Developers assert that they filtered out problematic text, but there is no way of reviewing what got through their filters or assessing the impact of its inclusion. Thoughtful, intentional selection of representative training data is essential – but without greater transparency, we have no way to verify it.
Before she leaves Barbieland, Barbie asserts: “we fixed everything in the real world so all women are happy and powerful.” AI won’t deliver us to that utopia, but we can and should design it in a way that it doesn’t manifest and further entrench societal biases and gender stereotypes. Properly designed, AI models could be a tool to level the playing field and breakdown pernicious stereotypes. Representative training data -- and transparency into that data – is integral to that effort.