Podcast: The Societal Impacts of Foundation Models, and Access to Data for Researchers

Justin Hendrix / Apr 14, 2024

Audio of this conversation is available via your favorite podcast service.

This episode features two conversations. Both relate to efforts to better understand the impact of technology on society.

In the first, we’ll hear from Sayash Kapoor, a PhD candidate at the Department of Computer Science and the Center for Information Technology Policy at Princeton University, and Rishi Bommasani, the society lead at Stanford Center for Research on Foundation Models. They are two of the authors of a recent paper titled On the Societal Impact of Open Foundation Models.

And in the second, we’ll hear from Politico Chief Technology Correspondent Mark Scott about the US-EU Trade and Technology Council (TTC) meeting, and what he’s learned about the question of access to social media platform data by interviewing over 50 stakeholders, including regulators, researchers, and platform executives.

What follows is a lightly edited transcript of the discussion.

Sayash Kapoor:

My name is Sayash Kapoor. I am a PhD candidate at the Department of Computer Science and the Center for Information Technology Policy at Princeton University.

Rishi Bommasani:

And I am Rishi Bommasani. I'm the Society Lead at Stanford Center for Research and Foundation Models.

Justin Hendrix:

You are the two chief authors of this paper "On the Societal Impact of Open Foundation Models." A lot of names on this paper however, how'd you put this together?

Sayash Kapoor:

So the origin story of the paper is quite interesting. Back in September 2023, we hosted this workshop called the Workshop on Responsible and Open Foundation Models. And the overarching principle of the workshop was that the people who hosted and participated in the workshop had very differing opinions on open foundation models.

But nonetheless, all of us thought that open foundation models have a space in society, in policy, they can be useful for research and so on. And so we just came together to figure out how to release these models responsibly, what are the various policy considerations and so on.

And after the event, we realized that we had a lot to say. There were a lot of things that were unclear and so this paper was basically almost like a follow-up to the workshop, but in the sense that we got a lot of our thoughts together based on what we discussed at the workshop. And we got a lot of the people who were speakers at the workshop, organized the workshop, participated in it to come together and write this.

Justin Hendrix:

Okay. So the first thing I have to ask, of course, is how you define open in this context. Obviously, there's a gradient. As we all know, a spectrum of open versus closed when it comes to AI models. You've I think taken a reasonable shortcut here in presenting a binary between open and closed. What is an open foundation model for the purposes of this discussion?

Rishi Bommasani:

So as you said, there is this gradient and I think it's useful to recognize that it's not just about the model leads, but there are a number of assets that can be variably open or closed.

For the purposes of this paper, we defined it as an open foundation model is one whose weights are made widely available. This maybe not coincidentally aligns with the executive orders definition and the one that NTIA has been using on the US policy front.

And so, a side effect of that is really we're tying it to the weight so there's no sort of requirement let's say for the data or the code or anything else to be open or not open or to some extent, though I'm sure that will come up. And this is not the same as open source which is going its own definition process, but notably traditionally for software requires that there be no sort of usage restrictions placed on open source software. And so, in the case of open foundation models, these include models that sometimes have usage restrictions.

Justin Hendrix:

Let's talk about what you regard as the distinctive properties of open foundation models. So I think this plays into the definition somewhat.

Sayash Kapoor:

So, we thought about this quite a bit and figured out that there are five key things that drive discussions of openness. So the first property is broader access. So just by definition, given our foundation model is open, means that its weights are widely available. And so that also means that essentially, the weights are available for use by a wide variety of actives. You don't need to have an account to sign up for access to it, you can download it.

And the counterpart of this definition is also that essentially it's very hard to enforce who can have access to these models. So notably, Meta with its release of the LLaMa 1 series of models tried to restrict the availability of the model weights. They had a researcher access form that people had to fill out before they could get access to the weights.

And very soon after Meta released the model, I think it was within a week of its release, the model weights were leaked on peer-to-peer torrent websites. And that also meant that Meta had essentially no ability to claw back the release of these models. So, that's another aspect of it that once these models weights are released, to some extent, this decision is irreversible.

That also leads to another other distinctive properties which is that you can't really monitor how people are using these models. So this is in contrast to services like ChatGPT or Anthropic's Claude where the developer has a lot of visibility into how downstream users are using these models, what they're doing with it, and they can to some extent also enforce the usage policies very easily.

Well, compared to that, at least in some instances, open models are harder to enforce against. So for example, I can run Meta's base LLaMa 2 model like the 7-billion-parameter on my laptop and essentially no one has any visibility into how I'm using it. And so that also means that it's harder to restrict or moderate usage compared to closed models that might be behind a paywall or might need API access through and so on.

So this last point also means that we can run the models locally and therefore we can also make use of these models not just to avoid use restrictions but also to use these models for beneficial lines. So in a lot of cases, for example in the healthcare industry, you don't want your data to go out to open AI servers.

You really do want to run this data, run all of your models and put your data into local servers which you have strict access controls over. And open foundation models allow people to do that by default. It's not to say that people can't make deals with closed model providers to think about restricted access but it's just a lot simpler if it's an open model that you can simply download and run it as it is like a piece of regular software on your local inference.

And not just running these models but also customizing these models is much easier as a result because you have access to the model weights. And so you can basically fine-tune the model, you can look at how you're decoding it, you can change things in how the model is run. You can basically do all sides of deeper customization simply because you have access to the model weights that are being used.

And so these I think with the five distinctive properties that we looked at in the paper and they also give rise to a lot of the benefits and risks which we discussed in the paper. For instance, a lot of the misused risks associated with open models, especially [inaudible 00:07:20], simply because you can't receive access to problematic users.

Justin Hendrix:

And there are a bunch of benefits that you go through here, everything from distributing who defines acceptable model behavior, so more people get to make that decision. The obvious ones around increasing innovation, accelerating science, those are a lot of the benefits that lots of proponents of open models generally point to the opportunity around transparency.

And then, mitigating potentially market concentration or the model culture around artificial intelligence, I think a lot of folks hold out hope that perhaps there's some alternative to only an AI ecosystem dominated by a handful of companies. We'll see if that's possible.

You point to this risk assessment framework that you've put together here. I think this is good. You've got various types of misuse risks that you've categorized. How are you thinking about risk? What are the risks that are specific to open foundation models?

Rishi Bommasani:

Great question. So I think the first thing we did is we tried to survey the set of risks that people have associated with openness in the foundation model space. We identified seven categories or threat vectors that we focused on which were those related to cybersecurity and biosecurity, disinformation, voice cloning scams and spear-phishing scams and scams more generally. And then finally non-consensual intimate imagery and CII and child sexual abuse material, CSAM.

Those are the seven we identified. It's worth saying that there are other risks that people talk about maybe geopolitical kind of concerns or concerns about autonomous machines and so on. We didn't really consider them here and so we really focused on these seven categories.

But to do that, I think the thing we did first was we tried to define this kind of risk assessment framework. Because in our own work in the space over the last year or a year and a half, I think what has been clear is that there's plenty of debate and controversy about whether openness is good or bad, whether it will have benefits or risks or how they compare. But on the risk side, it seems because of a lack of a shared conceptual framework, we often ask each other. And so that was what we wanted to do before coming back to these seven risk categories.

The framing we took is really centric on the marginal risk. And in part, this is a response to a lot of the discourse we saw over the past year which I don't think really took into count this perspective. And so by the marginal risk or precisely what I mean is what are the kind of risks that open foundation models contribute relative to some baselines? And usually these baselines are either going to be existing technologies that have existed for a while like search engines or closed models, especially when we're trying to compare these two things.

So given that lens, then the question is how do you assess the marginal risk and how do you think about it? So, the first thing is we want it to be pretty precise about let's really articulate the specific threat vector of interest and how we think harm is going to manifest because I think sometimes this is presented in a bit of a nebulous way and we want it to be pretty clear like if we're going to talk about risk, we should have a clear model of what we think is harmful.

And then given that, we need to identify the two preconditions like what are the baselines that are happening absent open foundation models. So, what is the state of risk and what are the existing defenses to address that risk? And so that really is where are we societally even independent of whatever is going to happen with open models. Then what do open models add on top of that? So what kind of new risk is added? What is the evidence of that new risk? And how will society adapt to that risk so how easy is it to defend against it?

I think this is an interesting thing because for some of the risk vectors, openness maybe contributes to new risk potentially but it's well addressed by existing defenses, whereas for others it's not really as well addressed. And then finally, across all of these things, I think one thing that's just fundamental about doing this rigorously is people are going to have different assumptions or reason about uncertainty in different ways, but we simply need to communicate about that to make sure that even if we disagree, it's clear what those sources of disagreement might be.

So that's the risk assessment framework is those kind of steps and then this principle of uncertainty or assumptions across the way.

Justin Hendrix:

So let me ask you about one of the misuse risks that you address here. And maybe it could help us think through how to think about the specific risks of open foundation models.

One is child sexual abuse materials. You look at that one across everything from threat identification to whether they're existing defenses, what the evidence of marginal risk is, etcetera.

And one of the things that I hear from other folks who are working on this issue particularly is that the real problem here is that the volume of synthetic child sexual abuse images has essentially overwhelmed the existing systems for identifying those images, hashing them, reporting them to law enforcement, investigating in some cases if there appears to be a need to find out what exactly is going on in a certain image.

How does this conceptual framework or paper that you've developed here help you think through that problem and the set of risks that come out of this particular misuse?

Sayash Kapoor:

Absolutely. So I think child sexual abuse material is one of the risks where we are already seeing evidence for the marginal risk of releasing models openly. I think models like Stable Diffusion which were openly released back in 2022 have been responsible for quite a bit of CSAM and especially AI-generated CSAM.

And unlike a lot of the other risks that we often think about or associate with AI, I think for CSAM, things are slightly different as you mentioned. The objective of platforms, both social media platforms where these images might be shared but also entities that are investigating these. So, in the case of CSAM, it's the National Center for Missing and Exploited Children or NCMEC.

So the main task assigned to both of these types of institutions is twofold. On the one hand, you want to take down this content because it's objectionable content. This violates the platform's terms of use and you probably don't want your users coming across AI-generated CSAM.

But on the other hand, they're also responsible for coordinating, like NCMEC in particular is responsible for coordinating real-world investigations because their concern is not just that objectionable content is posted online, but also their concern primarily is about the well-being of children.

And so at that point, that's a very different type of concern compared to much of the discussion that often revolves around content moderation and so on, simply because they need to be able to have the operational capacity to interact with law enforcement, to help identify the children in the images, to help figure out where they live so that they can have wellness checks or whatever.

And we are already seeing this type of overwhelm come about in the NCMEC's operational capacity simply because they're basically unable to tell which of these children are fake. And so, the risk sort of misappropriating resources to children who don't even exist.

So in terms of the risk assessment framework that Rishi just shared, I think the threat identification step is very much about digitally altered CSAM. And I think digitally altered CSAM is not new. We've had to deal with it in the past, too. People had tools like Photoshop and so on.

But what's new is the ability to first scale these models rapidly. So the amount of AI-generated CSAM that exists out there can just increase at a rapid pace. But second, we also have the ability now to create fine-tuned models that can create CSAM about specific people and more generally create NCII about specific people.

And so, this is not something that's possible with Photoshop, right? So, there's this platform called Civitai where late last year, it was found that the platform allows people to host bounties and they're in fact bounties for creating deep fakes and pornographic deep fakes of real people. And so, that's a capability that never existed before. That's basically the entire cause for the marginal risk.

And it's also why existing defenses such as like NCMEC's defense to figure out which children are being exploited in the real world basically fall short. And while there are ways we can take action, so one of the points in the framework is the ease of defense against this marginal risk. We can take action at this downstream attack surface. Things like the Civitai bounty platform to make sure that people can't generate NCII of like existing real people.

But at the same time, there are also ways for people to keep sharing these images despite all of these attempts. So, a lot of the deep fake NCII spread usually happens on end-to-end encrypted platforms. Telegram was responsible for a large amount of spread of deep fake NCIIs.

In fact, this 2020 report showed that even before the use of diffusion models and generative AI or I should say the latest wave of generative AI and diffusion, over a hundred thousand NCII deep fakes were shared on these telegram bots. So it does seem like we can do things to shore up our defenses but they likely won't be enough.

Justin Hendrix:

And the two other examples that you go through in depth here, of course cybersecurity and nonconsensual intimate imagery, another place where there's just an explosion of material that is presenting a problem to those who wish to mitigate it and even those who wish to investigate it in certain contexts.

Let's talk about the recommendations. You have a various kind of calls to action here that you want various entities to do things from developers on through to policymakers. What do you think are the most important ones?

Rishi Bommasani:

I think I would bundle these into two different categories. So one I think is about increasing information or the evidence base for reasoning about both the benefits and risks.

As you said in the beginning, there are many proponents for openness and many critics of openness. And I think at the moment, a lot of the argumentation on either side is conceptual or theoretical like we have these benefits we can point to, innovation or combating the concentration of power. We have these misuse risks on the other side. And the evidence base including as our work shows is maybe really not there especially to make substantial policy decisions.

So, I think the first category is how do we improve the evidence base and how do we increase the amount of information and insight we have into what's going on? And so, on the benefits side, I think the sort of key thing really is just even understanding how open foundation models but honestly all foundation models are being used in the economy and in the market.

I think we know back in say 2020 when GPT-3 came out, of course that was early and we were just seeing the first sort of applications. But now it's what? Three, four years later. And I think our amount of understanding of how the market is being impacted and where these models are being used is surprisingly not really advanced that much.

And so I think we on the market surveillance side want to understand where the models are being used for applications generally beneficial and then separately more work monitoring these kind of malicious usage instances and just research into the marginal risk. So I would say that's one piece is it's not necessarily about taking any specific action, it's about just understanding what is even happening.

And then the second category is maybe what actions should we take and who should take them? So I think when we talk about open models, one thing that I am concerned by is that the relationship between different actors in the supply chain of who should mitigate risk or who should be responsible for things is not obvious. And it's not to say that there needs to be a particular arrangement but it should be clear how things are happening.

So for example, if a model developer releases their model openly and then some downstream application developer builds upon it, we might want that at least one of these two entities, if not both, sort of implement some practices to make their AI applications responsible. And it's not necessarily to say that what has to be the model developer has to be the application developer. It can differ based on instance. But we should be clear that somebody is doing this work and it should be clear to those two entities what the other is going to do or has done and what they're doing.

Similarly, when we talk about governments, I think we're seeing plenty of active policy across a variety of jurisdictions and foundation models. One of the things we really want to be cognizant of is what are the impacts on open foundation models? Because I think it is possible, and we wrote a policy brief about this last year, for policies to not name open foundation models specifically but to have a significant negative distributive impact on open models.

For example, there's some policy proposals for strict liability. That is, the liability for some kind of downstream harm that materializes propagates back up to the model developer including those intermediary links in the chain. And we can think through whether we think this is a good or bad policy proposal, but the thing that is clear is that this almost certainly will chill openness because you cannot really control your liability in this regime, right? You have no way to control it and unbound in it.

That's a pretty strong disincentive to releasing things openly. So again, this is not to say that liability isn't good or bad mechanism but just if we are considering this, we should understand it is going to have a sort of disproportionate impact on open model developers and the open model ecosystem and we should actively reason about that in reasoning about this policy proposal.

Justin Hendrix:

You point in particular to a couple of different proposals that reference liability around AI, a bill from Blumenthal, Richard Blumenthal and Josh Hawley, the provisions around watermarking that are in the Biden AI executive order. You also point to provisions that have been put forward in Chinese AI policy, the G7 Hiroshima Summit.

What do you look to out there at the moment you think is an appropriate way to think about open foundation models in a policy context? Are there examples of proposals or legislation that you think get it?

Sayash Kapoor:

I think we're still at the infancy when it comes to policy proposals about AI. I mean, at the risk of giving a circuitous answer, I think maybe we haven't really seen too many policies that deal with the question of openness and do it in a way that carefully considers the impact on openness.

I will say though that a lot of the policies that aim at transparency, not really at enforcing liability or looking at specific legislative mechanisms to curtail openness but merely about transparency in respect to various aspects of the foundation model development process. I think those might be much easier for open foundation model developers to comply with.

And they also serve this dual purpose of not just forcing model developers to have mechanisms for themselves understanding what the impact of the models is, but also in serving the first of the two buckets that Rishi just talked about, which is improving society as a whole understanding of what the impact of openness is, but also more generally what the impact of foundation models themselves is.

So here, I think the couple of proposals that come to mind are EU's AI Act which has a fair number of transparency provisions and specifically when it comes to open source AI as well. Though again, it has its fair share of problems. As far as I know, the latest draft does not really define what open source means. Rishi can correct me if I'm wrong.

But also, I think the US Foundation Model Transparency Act goes a bit further in terms of what specific transparency requirements it imposes.

Rishi Bommasani:

Yeah. As I have said, I think we're definitely seeing initial sort of spate of policy proposals and even implemented policies as this will eventually be the case for the AI Act. But I think one of the things I feel like is complicating the policy proposals as relates to one of the points we made earlier is because we don't really understand what AI models are being used, whether those are open or closed or whatever. It means that we are devising policies under undue amount of uncertainty.

We're hypothetically reasoning about all of these different ways in which models could be used. Because the models have fairly general capabilities, in theory, they could be used for a variety of things. But the reality is they're only used in some specific ways. And if we understood those specific ways, I think we could draft and implement policies in a much sharper and more effective manner than we're currently reasoning about them.

I was pretty actively immersed into the AI Act negotiation last year and I think it's just interesting that for all of these legislators who ultimately need to come to agreement to get the Act negotiated, they're just reasoning about for the very same technology, often thinking about the very same companies as having a fairly different impact.

Probably all of the legislators are wrong in the sense that none of them actually really know how the models are being used. And so, it's not that they're designing policy proposals that misalign with their intentions, it's just that they don't have the information to make these proposals. They're just very uncertain about what is happening.

And I think reducing that uncertainty will also maybe get us to maybe proposals that are better aligned with the objectives we have for regulation or for other things in the space.

Justin Hendrix:

I assume that will happen with time. We'll see more empirical evidence. We'll see how the models are actually deployed. And we'll see what goes wrong in certain circumstances and maybe that will change the calculus that lawmakers have, but perhaps even the general calculus you've made here about the benefits versus the risks of open foundation models.

Let me ask you this. Does your paper point you towards any, I don't know, technical opportunities with regard to addressing the risks of open foundation models? Is there anything that you think that the folks who are working in that domain could do to address some of the risks with new technology?

Sayash Kapoor:

Some of the interventions that we've pointed out are definitely technical in nature. They're not always at the level of the foundation model developer. So there are some interventions that foundation model developers can take.

For example, as Rishi pointed out, one thing that's very unclear right now is how the burden of making these models responsible or whatever falls on the foundation model developer versus the downstream developer. And I think technical interventions can help in terms of maybe the developer can provide these default-aligned models that downstream users can use for certain tasks but not others.

But there are also technical interventions that fall outside the model developer themselves. One of these interventions for instance is the downstream platforms or the downstream attack surface where misused risks arise. An example is in biosecurity, a lot of the discussion around the risks of biosecurity has been about what information language models can provide to a certain malicious actor who wants to commit a bioterrorism attack or something?

And I think absent from this discussion is this analysis of how someone who has access to let's say a state-of-the-art LLM would even go about acquiring the resources, acquiring the capabilities, acquiring just the materials to create these pathogens. And I think there, we can do some more work, even technical work in order to make these types of things harder.

So, one thing which really stood out to us from last year's executive order on secure, safe and trustworthy AI was the fact that the executive order has provisions for biosecurity provisions for screening DNA synthesis tools. And basically, it has this provision which is a technical provision in some sense that people who manufacture this type of technology need to have screening mechanisms in place.

Now, this seems to me like a far more useful intervention, not at the level of the foundation model but at the downstream level that can actually do things that actually lead to a huge reduction in biosecurity risk, whether from AI or not. And so, it seems like in some sense, the marginal risk analysis framework also offers this way to open up the space of interventions to not just fall on the foundation market developer but also at very different actors in society.

Rishi Bommasani:

Yeah. And to just build on that last point, I think these technical interventions of those downstream I think are not just underappreciated but something that in many cases the folks in those areas have been advocating for a while. For example, the sort of hardening defenses for bio. But similarly for cyber, there has been for a while many sort of instances of identifying cases where we are vulnerable to cyber exploits and we have not in spite of these identifications hardened the cyber defenses we have.

So I think one of the interesting things maybe akin to the way the executive order has done it is whether the sort of discourse on foundation models and openness can be an opportunity to address some of these sort of long-standing issues, imbue this entire conversation with new energy that sort of catalyzes action even if the problem was already a problem notwithstanding the advent of AI or foundation models. And so I think there's a really interesting thing that EO did and I'm wondering if we can do something similar for bio-cyber in the same way.

Justin Hendrix:

What's next for the two of you? Which way are you headed with your research?

Rishi Bommasani:

One of the other things we're doing is a line of work on transparency. We've been running this kind of transparency index where we look at different foundation model developers and how transparent they are across a variety of different factors. We'll put out the next version of that next month expanding the set of developers we look at.

I think that will be pretty interesting because it has this kind of link to openness. And I think it's a link that often the two maybe get conflated as in open is the same as transparent. I think for some interpretation, that could maybe be true, but I think one of the things we really want to understand is the extent to which that is true.

For example, just because a model is released openly doesn't mean there's transparency into the data or into the labor practices or the compute or that's all upstream or how it's being used downstream. And conversely, even if you had transparency into all of those things, it doesn't need to be open. And so I think that's one of the interesting things is really grounding how do these two things interplay with each other?

Sayash Kapoor:

And then more specifically on openness, I think the NTIA has this open comment process that's going on right now. So we are planning to write this response based on the paper about what the NTIA should do in response to I think it had a set of 50 odd questions. So that will be out very soon.

Justin Hendrix:

I appreciate the two of you taking the time to speak to me about this paper and I appreciate all of the work that you're doing around these issues, around foundation models, around transparency. And I hope you'll come back and tell me about that next round of transparency index when you get the opportunity.

Rishi Bommasani:

Amazing. Thank you so much for having us.

Sayash Kapoor:

Yes, absolutely. Thanks for having us.

Justin Hendrix:

Just to note that since we recorded this discussion, NIST has released the results of its public consultation.

Next up, I had the chance to speak last week with Mark Scott, POLITICO's Chief Technology Correspondent. We discussed his coverage of the most recent US-EU Trade and Technology Council meeting and is reporting on questions related to access to platform data for independent research as part of a project he completed as a fellow at Brown University's Information Futures Lab.

Mark Scott:

My name is Mark Scott. I'm POLITICO's Chief Tech Correspondent.

Justin Hendrix:

Mark, you're POLITICO's Chief Tech Correspondent but you have also become Tech Policy Press's Chief TTC Correspondent always coming on to tell us what's going on with these EU-US Trade and Technology Council meetings. Let's start there. What went on this time in Belgium?

Mark Scott:

If I am the correspondent, I'm waiting for my check, Justin, so I'll look out. Yeah. So, as you mentioned, this is the sixth iteration of the EU-US Trade and Technology Council held in the mighty Leuven which is outskirts of Brussels.

It was more a sort of a cavalcade of patting on the back, to be honest. There was a variety of super techie and trade-focused policy announcements around quantum and sustainable green tech and all of these things we can get into. But mostly it was about Antony Blinken, Margrethe Vestager and the other officials getting together to say, "Look how far we've come under the Biden slash von der Leyen era. We have to go back to the end of the Trump administration when the TTC first got going and how bad things were transatlantically between Washington and Brussels."

The whole point of the Trade and Tech Council was to prove those relations. And to be fair, in the last three and a half years, they've done a pretty good job.

Justin Hendrix:

You report that there is a sort of gallows humor or perhaps the sort of looming sense that things might be really about to change.

Mark Scott:

I think you can't look past the US November election on anything. And this is not just a trade and tech issue. It goes beyond for climate change and the war in Ukraine, etcetera. Depending on who takes over the White House next January really will set the beat or the drum, whatever the metaphor is, in terms of both transatlantic relations, global relations, western relations.

And where the poll sits, there is possibility that Donald Trump may regain the White House and that at least from this side of the Atlantic where I sit has raised many eyebrows in this particular era, mostly because on trade and tech, the previous Trump administration didn't really see eye to eye with the Europeans.

Justin Hendrix:

So, what are the policymakers that gathered there going to do to try to bake in some of their agreements? What is available to them to do that?

Mark Scott:

I think a lot of this is about promises but also trying to bake in some institutionalization, if that's a very wonky word, around what the Trade and Tech Council does. So even if the TTC disappears, the bilateral relationships, the WhatsApp groups, whatever it is between the mid-tier officials continue.

So I think what we're going to see on say semiconductors is an ongoing three-year commitment to keep each other abreast on both government subsidies on chips. So whatever the US does with its Chips Act, they'll let the Europeans know and vice versa. Also on chips, they're going to be some sort of early warning system in case like we saw during COVID, a massive supply chain bottleneck so they can flag each other what's going on.

I think it's less tech-related. On the trade stuff, I think although no one expects a new trade agreement between the US and EU, the idea of working on wonky things like mutual standards on electric vehicles and that stuff does create a network effect where both sides can benefit. So I think that economically you'll see ongoing commitments from both sides.

Then becomes the question is how willing is Europe, will they sign up to the DC view on China and the hawkish view both promoted by Biden and previously Trump? I would suggest that at least Brussels and the European Commission are more hawkish on China than the EU member countries. So you've seen movement in that. The TTC document published name-checked China, Chinese foreign interference, things like skewing the hot global medical device market which obviously is on everyone's lips.

So, a lot of these things is about goodwill and hope that no matter what happens with the US and November election, some of the issues at least outside the US can be relatively [inaudible 00:38:11].

Justin Hendrix:

Was there any discussion of TikTok in this context?

Mark Scott:

There was not. I think it's a bit ironic we can get into the purportations of TikTok in the US. But from the European perspective, if we're looking at foreign social media companies gathering data for potential national security reasons, you got to look at some of the US ones and think I get as much as they post note and say they don't do that. There are questions still open.

So, it's a little bit ironic the US pushing the TikTok national security thing when from the Europeans perspective, there are questions around YouTube, Meta obviously who deny any those allegations.

Justin Hendrix:

Let me ask you about another subject which I did have the pleasure of having you write about for tech policy press this week, which was also discussed at the TTC and appears to have been the subject of at least one agreement released from the Trade and Technology Council, to this question about independent researcher access to social media platform data.

You've written a piece for us titled Survey: New Laws Mandate Access to Social Media Data, But Obstacles Remain. You talked to a lot of folks out there across a lot of different domains concerned with the question of researcher access to platform data. What did you find?

Mark Scott:

Before we get into that, just the nature of the TTC. So what they agreed was a EU-US commitment to continue ongoing outsider research access for social media. The reason why I brought this and I wrote for you guys under the guise of my recent finished fellowship at Brown University's Information Futures Lab is that it's crucial and existential to what I do as a journalist.

I need to know what is going on, on TikToks, Facebook, Telegram, all these channels because you can't hold companies to account, and frankly from a policymaker's perspective, make good rules if you don't know what you're doing or don't know what you're looking at.

So what we've seen, and that even goes to the European Union's Digital Services Act and other social media laws that are coming down the pike, many of these provisions and regulations are being made in a vacuum because unfortunately the social media companies remain in a black box.

So about three years ago, I got frustrated because Meta's CrowdTangle was becoming more difficult to use. The EU via its Digital Services Act has these mandatory data access provisions that are supposed to force companies to provide some sort of outside accountability via independent data access for researchers.

And I wanted to figure out what was going on. So, I was fortunate to sign up to Brown's and its fellowship. And I spent 18 months interviewing anonymously regulators, policymakers, academics, public health authorities, anyone I could think of who has a skin in this game to figure out, A, how do you do it currently? And if I gave you a wish list, what would you want it to be?

And so the hope and we can get into it is there are some good things, there are some bad things, but I think everyone agrees that we are not in a good place right now in terms of boosting accountability and transparency. And that for me fundamentally begins with better access to data.

Justin Hendrix:

In this piece, you lay out the conversations you had and some of the key learnings based on the type of stakeholders. So, the first group that you look at here is regulators. Talking to regulators both in Europe and the US, what did you find?

Mark Scott:

One thing let me just say is the US regulators are very limited given the lack of movement on digital policy in the US. But I think if you look more broadly, not just in Europe but in other international regulatory settings, there is a growing movement to have mandated access for regulatory oversight. So they are now asking companies for specific data about how they are mitigating online hates, Russian interference, all these kind of things that could have been important.

What you find though is regulators, many of whom, have never done this before. They are very well-intentioned. They are incredibly smart, most of them. But they haven't actually got under the hood and looked at the data, the raw data before.

And so, they don't know what technic you want to ask for. If you go to any regulator who's got these new powers and say, "Okay, give me the top five things you want," go, "We think we want this but we don't really know until we look." So, what you find is that they are making these requests to companies for specific regulatory oversight provisions and they don't really know what they're doing yet.

And I think what we're finding and what I found through my research was there are some people, what's the metaphor I'm looking for, falling around in the dark thinking and hoping that they're asking for the right thing. But until we really know, it's still very much a shot in the dark.

Justin Hendrix:

Almost like an imagination thing, it's hard to even imagine what types of data or even inferences the social media platforms might have and what you should ask for in order to get at those insights.

Mark Scott:

Totally. I think the best proxy I use is the variety of primarily Meta whistleblowers including like Frances Haugen, etcetera, who's shown what type of information data that can be collected and researched within the companies. And I think what you're now finding is regularly saying, "I kind of want a piece of that." But then what do you ask for? How do you ask for it? Over what time period do you have the jurisdiction? What about legitimate privacy and free speech implications?

All these things a web needs which are very difficult to meet. And I think what we're finding, at least within certain contexts, certain countries that have these new online safety regimes, they are still very much in the first gear if these are a car metaphor about what to ask for and what they're finding is the companies for legitimate legal reasons are saying no first. And they don't want to give this information. And it leads to fines and potential regulatory remedies.

And so it's easy to roll out the lawyers and say no and then fight in court. And then the regulators then have to have this back and forward about, "Okay, can we have this? Can we have that? What about this?" And it's a lot of, frankly, not very quantifiable yet and I don't think we're going to see much movement on that until we get the first iteration of these regulatory requests over the next say 12, 18 months.

Justin Hendrix:

Not too many social media platforms ready to become social engineering as a service. That's not really what they're hoping to do.

Mark Scott:

I think the public policy people would say so. And to be fair, many of these companies are willing to play ball. There's a legitimate regulatory bind involved in many of these regimes and therefore it's understandable where they go, "Do we have to? What is the limitation on this and how can we limit our risk?"

Justin Hendrix:

One place where this is perhaps most urgent is another set of stakeholders you spoke to, public health authorities. What did they tell you?

Mark Scott:

Yeah. I think what was really interesting about that the fellowship I did was actually based in Brown's Public Health Department. So, I was embedded with a bunch of very smart public health officials and academics give me the sort of the cold face view of what was going on.

Let's be very clear. The public health officials even now still need this information to do their work. We can suggest that COVID is gone or hopefully waning people's memory. It hasn't gone anywhere and we might have another pandemic. And the ability to track this and the wave of information and how to respond to it particularly when it comes to misinformation on anti-vax potential, that is still very much front and center.

It was very clear from everyone I talked to, both the authorities and also the public health academics and experts, is they have no control or power or even insight into this because they really haven't done it before at scale. And so they went into the COVID-19 pandemic with limited resources, limited know-how and limited expertise and more importantly, limited connections to the companies to make the best of a bad situation.

And so, what you had is a lot of well-meaning public health authorities and academics try to find any type of insight on social media via closed Facebook groups, crowdsourcing friends and family, things that frankly are not quantifiable but all very scientific because they didn't have the resources, they weren't thought about in terms of how they would access data via these new social media regimes.

And unfortunately, as the COVID-19 pandemic has waned, the limited resource they had are being cut back and both the regulators with the power and the platforms with the data are really ignoring them because they're no longer seen as a crucial stakeholder in all of these.

Justin Hendrix:

That brings us back to the platforms. You've already mentioned that they of course play a big role in often putting up their hand and saying, "Wait just a minute there. Why exactly do you need this information and how might we go about putting up a roadblock to you getting it?" What do you make of them?

You spoke to some of them. It's not just legal concerns, it's also some engineering concerns. And where's the data? We don't exactly know. Our systems aren't engineered optimally to meet these types of requests.

Mark Scott:

I'll have to hold up my hand and say I'm a journalist, I'm not a data scientist, I'm not a computer scientist. I respect anyone who can code because I can't. But it was really interesting to talk to these individuals to realize that frankly inside these incredibly lucrative, apparently well-functioning organizations, the data was everywhere and it was not well-kept.

Many of these databases have been built on the fly. There were different accesses depending on which departments and there was obviously inter-department rivalries because there always is in any organization. And so what you had was limited. There wasn't really an index of all the data available in say a certain social media platform available to anybody. It was, "We need to know. Have you talked to this person? Have you talked to that person? We think this is available. What I don't have access, this department has."

And so, it really was a, to use a British system, a hodgepodge of databases that weren't aligned, there was no indices to understand what was going on. And more importantly, there was in some companies limited oversight about who could access that information. And in terms of data security and privacy, that obviously is a concern.

And so I think when we ask the companies, "Can you give us access," the first question we should be asking is, "Access to what and how?" I think that is a basic question that still hasn't been answered.

Justin Hendrix:

Finally, the independent researchers, the academics and civil society folks, I guess that includes journalists on some level, this kind of other category of independent experts that want to get access to these things. I feel like these are the lot that I talk to the most and the ones whose concerns are most familiar to me. But what did you learn talking to this group?

Mark Scott:

I think you're right that I fit in that category for-profit journalist as part of that cabal of civil society and academics. I think there's a lot of different interest out there. There are certain academics who are doing noble work but they're going to publish reports and academic papers in three years' time based on the 2020 election cycle, which frankly is too late for immediate transparency and accountability.

There are civil society groups who know their communities, speak the languages of their communities who are trying to get access at direct daily insight because that is what's affecting the communities now. And the people like me sit in between where journalists who do the accountability work which apparently is what we're supposed to be doing. As the fourth estate, we need that too.

So there's a lot of pulling parts to that. I think what you're finding is that there is an effort to corral those forces into one mostly around, "Okay, if the Europeans are bringing this mandatory data access provision in, what does it look like? How do we access it? In what privacy conscious way can it be done?" And I think that the community somewhat between do we rely on the companies providing access or do we do it ourselves via so-called scraping which is going in and taking the information directly from the public-facing parts of these social media platforms and building it ourselves.

Both can work, both have limitations. But I think right now, we're in the place where that group of individuals are trying to figure out what the ask is, what are the top five priorities? And I think until we know that, it's difficult to move forward.

Justin Hendrix:

This question about accountability versus science, I've heard this come up before a little bit, how the motivations are different sometimes for say university researchers versus journalists. They think they have the same interests but often I think in practice, maybe they don't.

Part of that's timelines, part of that's just the general kind of orientation of science. But I don't know. Did you learn anything about that tension there? I remember talking to one journalist recently. We were in a conversation between academics and journalists talking about how to get better insights on social media data. And the journalist said, "Can't you guys just do this stuff faster, get us some preliminary results that we want to hear the top line? We want the headline." The academic of course threw up their hands, "No, that's not how this goes."

Mark Scott:

The thing is without trying to sit on the fence, both do good work. They're just different things. I am very much interested in reading the three-year deep dive of academic research into the machinations of data access for X, whatever the form be.

But I as a journalist, I want to know why do I care now? And I think what we need to do is marry the mid to long-term needs, legitimate needs of the academic community to do the long-term work that is apolitical and non-bias and all the things that academic work really does with the short-term firebreak, if you will, of the civil society groups knowing the communities and doing that now.

And then selfishly, journalists be able to go, "Okay, there is a bunch of elections coming up." We need to know what's going on within these social media platforms now so we can write about it so we can hold both politicians and others to account because that's our job. And without trying to be too high and mighty, that's what journalism is supposed to be.

So, I think there was a way to marry that short-term firebreak with the midterm academic research, both of which are laudable, both of which need to happen. But they are different and we need to acknowledge that they're different. And as much as these data access regimes are supposed to be universal, there is a tilt and we can discuss why towards academia and away from civil society.

That is a choice that is being made and I understand why. But you got to ask yourself in a year of 50 elections across so many different countries, what is the immediate need? And I again selfishly would say I want data access now so I can figure out what's going on ahead of November, ahead of say the June elections in Europe. There's even an India election starting next week that these things are urgent and they can't wait three years. We need them now.

Justin Hendrix:

There are a lot of urgent questions and yet on the other hand, I think sometimes that the science part of it is an important sort of circuit breaker on the potential of some of these online safety regimes like the DSA or the Online Safety Act in the UK. Because something else science can study is how the regulations have played out, whether they've had the intended effect, whether they have ultimately possibly produced negative consequences or negative side effects. So, to some extent, they're almost like a sort of self-correcting, data access should be a self-correcting kind of mechanism.

Mark Scott:

And again, you set the hypothesis and you could try and improve it or disprove it. I think you're right. Again, I don't think this is either-or. It needs to be both in terms of data access. It sounds super wonky and I hope none of your listeners are turning off, but data access isn't about me or an academic having access so we can do fun things. It's about we need to know as a society what goes on within social media platforms because they do still affect everyday life online and offline.

We don't know that right now. And with certain data access tools currently going to be shut down, these would be the Meta's CrowdTangle in August. We have a limited ability to just see what's going on. And so, the academics can provide a long-term view on are these regimes any good? I'm all for that because they need to be so I can do my work and regulators can make good policy and not just pass knee-jerk rules because they think they need to be doing something based on limited amount of knowledge and understanding of how these platforms work.

Justin Hendrix:

Mark, the last time we've spoken, you covered a Trade and Technology council meeting. There was some mention of the very nice pralines that were off to the side of the snacks. Were there any good snacks this time around?

Mark Scott:

I mean, Belgium is known for its food. Again, any French listeners will hate this but I think Belgian food is better than French food. And again, that's a very niche topic that maybe your listeners can write in about. I didn't see any fancy snacks. There was a lot of pretty good wine being drunk I think mostly to celebrate the end of the sixth iteration of this. It wasn't as fancy as it was back in Sweden two iterations ago.

Justin Hendrix:

I'm disappointed to hear that. Maybe next time. Mark Scott, what's next for you? What's the next big project post this fellowship?

Mark Scott:

Well, I mean if I have to drop something. So I am publishing a three-part series on AI disinformation and elections and that drops next Tuesday, the first chapter. So if you look out for that, it's taking me to Seattle. I'm off to Moldova in a couple of weeks so it's going to be a fun series. And the first three stories in that sort of nine-part series drop on the next Tuesday.

Justin Hendrix:

Something to look forward to. Mark, thank you so much.

Mark Scott:

Thank you.

Authors

Justin Hendrix

Justin Hendrix is CEO and Editor of Tech Policy Press, a nonprofit media venture concerned with the intersection of technology and democracy. Previously, he was Executive Director of NYC Media Lab. He spent over a decade at The Economist in roles including Vice President of Business Development & In...

Podcast: The Societal Impacts of Foundation Models, and Access to Data for Researchers

Our Content delivered to your inbox.

Thank you!

Authors

Topics