Jonathan Stray, Senior Scientist at The Center for Human-Compatible Artificial Intelligence (CHAI), Berkeley and Brandie Nonnecke, founding director of the CITRIS Policy Lab at UC Berkeley, provide comment to the European Commission on researcher access to platform data under the Digital Services Act.
The recently passed EU Digital Services Act (DSA) includes a provision for external researchers to request access to internal platform data, for the purpose of evaluating certain systemic risks of very large online platforms (including illegal content, threats to elections, effects on mental health, and more). The Act says that user privacy, trade secrets, and data security must be respected, but it doesn’t say how.
The European Commission invited public comment to determine how best to administer researchers’ access. This comment builds upon our UC Berkeley submission, further detailing an approach to enable researcher data access which is simple and powerful, yet protects the rights of users and platforms. It is based on a straightforward idea: send the researcher’s analysis code to the platform data, rather than sending platform data to researchers. The process would work like this:
- Platforms publish synthetic data sets — fake data with the same format as the real data
- Researchers develop their query and analysis code on this synthetic data, then submit their code to the platform for execution
- The query can perform arbitrarily complex analysis but returns only aggregated results to the researcher.
There is no standard name for this data access strategy, even though it has been used in many contexts. In data science there is a general preference to “move the code to the data” rather than vice versa because it is difficult and expensive to move large data sets. Aggregation is a standard method to protect user privacy — though it is imperfect, it is well-studied. The Hathi Trust, which oversees a huge collection of copyrighted texts, has for years supported a remote aggregation approach under the name of “non-consumptive research.” We will just call this paradigm remote query execution.
API Access vs Remote Queries
We are not suggesting that this should be the only way researchers can access data . The DSA Article 40.12 also requires platforms to make clearly public data available through an API (the so-called “Crowdtangle provision.”) API access is also a good way to provide certain kinds of non-public data to vetted researchers. But such APIs will necessarily be limited in the analyses they support; for anything beyond that, it will be necessary to download the entire dataset and run the analysis locally. This creates privacy, security, and cost challenges.
As an alternative, remote query execution is a good choice when:
- The research requires data that raises serious privacy and security concerns, such as detailed personal information on many users; or
- The analysis is non-standard and not supported by a general purpose API.
It’s a bad choice when:
- The research involves data that can be provided without serious privacy or security concerns, such as already public data;
- The analyses to be performed are standard (e.g. averages and other descriptive statistics) and might be implemented in a general purpose API; or
- The entire dataset must be processed locally for some other reason.
For research that goes beyond the capabilities of API access– either in terms of security concerns or analytic complexity– we advocate for supporting remote query execution as the default method. This will require capacity building among researchers, platforms, and regulators. Such an investment would enable remarkably flexible “self-serve” platform data analysis. Remote query execution sits in a sweet spot between security, privacy, cost and capability.
Remote Query Processes and Infrastructure
The influential European Digital Media Observatory Working Group on report researcher data access (EDMO report) was not optimistic about remote query access. We believe this may be based, in part, on misunderstandings (see endnote). Here’s why remote query execution is valuable, and what infrastructure would be required to make it a viable mechanism for researcher access to platform data under the DSA.
The platform policy community generally underestimates the security challenges of transferring bulk platform data to academic institutions. Such datasets, especially granular data on individual users, are immensely valuable for both commercial and political purposes. If such data were to be held by a university it would be an enticing target in a well-known location. In short, it will attract the most sophisticated global attackers. Platforms spend huge sums on security measures; academic institutions generally have neither the experience nor the resources to counter advanced persistent threats. The best defense is simply not to store the data.
Synthetic data provides a simple and secure way for researchers to test their queries. One of the major objections to the remote query approach is that it is nearly impossible to write a correct database query without testing it. The EDMO report recommends that platforms provide data “codebooks,” human-readable documents that describe the fields and format of available data. Synthetic data — widely used in medical research — takes this a step further by publishing fake data for each field, matching the format and perhaps the distribution of the real data. Researchers can validate their queries by running them on this structurally-equivalent (and sometimes statistically equivalent) data.
Virtual machines can enable full-featured query development. Researchers can log into a terminal or remote desktop and use a machine connected to platform data within a secure environment. Although it will never be possible to completely prevent the export of private or confidential information, that isn’t necessary as long as access is limited to vetted researchers. In any case the output can be limited in size to prevent bulk copying, and both commands and output are auditable. With appropriate virtual machine configuration, researchers can develop queries using a fully-featured development environment, including installing their own tools and uploading additional public data, and then request manual review for final queries which may exceed export limits. The Hathi Trust digital library already uses this approach to enable external full-text analysis.
Data aggregation provides built-in privacy. Certain public-interest research questions are best answered by running extremely granular analyses of private data. For example, a mental health study might compare user outcomes with exposure to content and subsequent interactions, over the span of months or years. This would require transferring essentially the entire on-platform history of all involved users to an external researcher, creating massive privacy and security concerns (to say nothing of the cost of physically moving and storing such a dataset). Yet what the researcher probably really wants is just overall correlations or effect sizes, hence the relevant queries need produce only summary results. These aggregated results may or may not be adequately privacy protecting (as studied in differential privacy) and may still need secure handling, but they certainly contain far less sensitive information than the original data.
A query provides a single checkpoint to evaluate privacy, security, and confidentiality. The DSA Article 40.5 says that platforms can deny a researcher data request due to security or confidentiality concerns, and the EDMO report therefore recommends the creation of an independent body to review requests with respect to privacy, security, and confidentiality. But what exactly will such a body review? A human-language description of the data requested must eventually be translated into a dataset by platform engineers. Instead, an independent body could review a) the actual query to be executed, which is both complete and precise, and then b) the output from this query given to researchers, which (because it is aggregated) will be short, archivable, and generally amenable to oversight.
Remote query execution supports arbitrarily complex analyses. Some analyses will depend on data that must be extracted or inferred from what platforms actually record. For example, platforms typically try to avoid collecting or inferring race or political opinions because of privacy concerns. To perform a study of discrimination these characteristics must be inferred, and researchers would presumably prefer to control the inference method by supplying the exact statistical code to be executed. Similarly, a study of the propagation of “political” content requires an operationalizable definition of “political” which may in practice be a machine learning classifier. There is nothing in principle preventing the execution of a complex researcher-provided classifier on platform computers.
A Flexible, Feasible, and Secure Approach
Remote query execution, enabled with synthetic datasets, secure virtual machines, and query auditing provides a comprehensive and secure set of tools for external researchers to develop and test their platform data analyses. It is likely also far simpler, cheaper, and safer than arranging for the transfer and secure storage of private user information. It should be developed as the default non-API mode of external researcher access to platform data contemplated under the EU Digital Services Act.
Endnote: the EDMO working group report was not optimistic about remote query access, concluding that a “non-consumptive approach would seriously jeopardize both the independence and integrity of the scientific research envisioned.” We believe they based this conclusion, in part, on a misunderstanding of the Hathi Trust’s capabilities. EDMO wrote that Hathi “does not permit the researchers to work within the secure environment to conduct and review the analyses” whereas Hathi provides each user with a Linux virtual machine which can be configured with “any other data or tools the researcher plans to use.” EDMO also wrote, “nor can researchers view the underlying data” while Hathi says that “researchers can request for their Research Capsules to have full-corpus access.” The working group may have been concerned to avoid an outcome where “non-consumptive” access ended up being the only kind available. Certainly without the synthetic data, virtual machine, and query auditing approaches above, remote query execution would be a much less useful method of platform data research.