OpenAI's Sora Is Here. There Is Still Time To Prepare For The Threat Such Technologies Pose
Sam Gregory / Dec 12, 2024Sam Gregory is Executive Director of the global human rights organization WITNESS.
Earlier this year, OpenAI announced Sora, “an AI model that can create realistic and imaginative scenes from text instructions.” This week, the product officially launched. Paying users can generate videos in 1080p resolution, up to 20 seconds long. With a mixture of excitement (as someone deeply rooted in the expressive and documentary world of video) and fear (as someone grappling with the realities of realistic video deception), I've been reading the System Card and looking at initial technical specs and examples shared by early users.
Here are a few initial reflections based on our experience within the ‘Prepare, Don’t Panic’ initiative at WITNESS focused on multimodal generative AI, considering the risks of deception with generative video. What are the ways in which products like Sora – since it is one of a range of text and image-to-video tools emerging now – could be used to undermine frontline journalism and human rights work if not designed, distributed, or used inclusively and responsibly?
1. Direct fakery and the ‘mis-contextualized real’ both matter.
The first observation is perhaps the most obvious. Realistic videos of fictitious events, and especially fictionalized context or additions to real events, align with existing patterns of sharing shallowfake videos and images (e.g., mis-contextualized or lightly edited videos transposed from one date or time to another place or staged events), where the exact details don't matter as long as they are a convincing enough fit with assumptions. Although Sora, as yet, doesn't do the physics of humans all that well (this would not be a tool to generate particularly believable protest footage, for instance) or yet allow most users to use real people as prompts, it’s worth considering that this is the worst such technology will ever be, and it's unclear if that restriction on real people will last.
2. Context is key, and yet re-contextualization to time, date, integrity of media, and surrounding context is hard even now.
Evaluation of the trustworthiness of content relies on contextual knowledge – of the genre, the maker, the material’s origins, and of other important context signals before and after the footage was filmed. It's why open source intelligence (OSINT) work in human rights and journalism looks for multiple sources and why media literacy approaches like SIFT encourage us to 'Investigate the Source', 'Find Alternative Coverage' and 'Trace the Original.'
We have tools for a direct reverse image search for images (and for videos via frame search) though notably not for audio, and we have emerging provenance approaches that help the public and analysts discover the context of a video, such as the C2PA standard for cryptographically-signed metadata that tracks the recipe of AI and human in a piece of content. Companies like Google have also been exploring how to better indicate context on content in search. In this light, it's good to see that OpenAI is investing in both provenance approaches – including C2PA signals in generated videos – and an internal-only reverse search engine for its own synthetic video content. While not complete, these efforts are an important start.
However, other Sora functionalities directly compromise the broader ability to understand the context or confuse it: e.g., Sora can add video (essentially out-paint for video) "forward" in time from an existing source image prompt, which raises complexities around falsifying context.
3. Not all real people are equally protected.
So far, Sora won’t generate based on seed images of real people except for a subset of users, but they indicate a likely roll-out beyond that. For the last eighteen months, WITNESS has been running a rapid response mechanism for suspected deceptive AI in the wild, a Deepfakes Rapid Response Force. One lesson learned from our experience and from broader observation of deepfakery is that real people in synthetic media is complicated – it’s not just prominent public figures who are targeted by both deceptive AI content and synthetic non-consensual or intimate images, but private individuals and lower-level public figures who lack the prominence to be tagged for protection within a model. And there is no broad agreement on when and if hyper-realistic satirical synthetic media targeting individuals is OK, or on how to moderate it.
4. Deceptive style matters.
WITNESS previously flagged to OpenAI (noting we were not invited to red-team) how one way deceptive content works is by tapping into style heuristics – e.g., shaky handheld footage is a heuristic for credible UGC content. The Safety Card primarily focuses on violative content red-teaming, which is important, but less so when it comes to content that confuses context or changes context. There is some interesting reference to addressing misuse through our model and system mitigations and using "classifiers to flag style or filter techniques that could produce misleading videos in the context of elections, thus reducing the risk of real-world misuse," though elections are, of course, just one scenario when it comes to potential political mischief.
All these questions come against a backdrop of public uncertainty about how AI is being used in the content they engage with and fundamental gaps globally in access to detection tools that inequitably disadvantage frontline journalists and civil society.
With Sora and other AI video generation tools already far advanced from their ‘Will Smith Eating Spaghetti’ stage and with potentially significant further progress in realism, length, and ubiquity on the horizon, we need renewed and inclusive attention on safeguards. We need to ensure companies developing these generative systems advance creativity and communication rather than deception and harm. This means establishing concrete measures to ensure we understand when they are being used, to prevent their harmful uses, and to equip journalists, civil society, and the public with tools to spot their deceptive video in the wild.