Non-Public Data Access for Researchers: Challenges in the Draft Delegated Act of the Digital Services Act
Lukas Seiling, Ulrike Klinger, Jakob Ohme / Dec 5, 2024On October 29, 2024, the European Commission released the draft of the delegated regulation on data access provided for in the Digital Services Act (DSA), a document long-awaited by the research community but of interest to anyone interested in big content platforms’ transparency and accountability. Article 40.04 of the DSA, to which this Delegated Act refers, grants researchers a legal right to access non-public platform data to study systemic risks in the European Union. Access to public platform data, such as social media postings by political parties during election campaigns, must already be made available in real-time according to Article 40.12 – not that any of the platforms has so far fully complied with this provision. This Delegated Act, however, focuses solely on providing clarification about how non-public data access will work, with what data, and for whom. The current deadline for the submission of responses is December 10, 2024.
Numerous researchers have welcomed many of the points included in the draft. Among them are provisions that platforms must provide a data inventory opening the black box of digital platforms; that researchers based outside the EU can apply for access; that data access is not explicitly associated with any cost; and that a broad range of data types can be requested. The draft, condensed into 16 articles and 28 recitals, meets many demands from platform researchers. However, researchers have also spotted a number of difficulties, such as missing possibilities for researchers to amend access requests or engage in a mediation process as a dispute settlement mechanism.
A data access one-stop-shop
Notably, the draft delegated act (DDA) introduces the “DSA data access portal” to be implemented and maintained by the Commission – digital infrastructure for the management and publication of ongoing and completed non-public data access processes. However, the fact that the Commission will be responsible for the portal’s rollout and maintenance raised a few eyebrows. After all, the Commission has yet to provide a search API for its Transparency Database. (While there seems to be ongoing work on tools for the database, the accessibility of one of the core transparency measures in the DSA remains limited for the time being.)
Putting these minor concerns aside and assuming the functionality of the access portal, let’s take a very brief look at the fundamentals of non-public data access envisioned in the draft delegated act (DDA). The DDA clearly states that researchers will be able to register on the portal to submit data access applications to the relevant DSC. A Digital Service Coordinator (DSC) then has 5 days to check if the application is complete. Afterward, the DSC must prepare a reasoned request that is sent to the data providers within 21 days. Key information about these reasoned requests will be published in the DSA access portal. Other dashboards will display all relevant reasoned requests to data providers, allowing them to submit amendment requests in case they cannot provide the data or requests for mediation if they are unhappy with decisions taken by the DSC.
Considering further intricacies that shall not be discussed here but are displayed in the Figure below, the attempt to streamline the overall flow through a central platform is commendable.
To lay out a number of concerns about the draft delegated act in the interest of improving its final form, let’s take a closer look at the publication of reasoned requests in the DSA data access portal. Two categories of information are to be published in order to inform other researchers about what research using non-public data could look like:
- a summary of the data access application, including a research abstract, the platform from which the data are requested, and the type of data to be shared
- the access modalities, meaning “legal, organizational and technical conditions determining access to the data requested”
The first category basically describes what data is to be shared and by whom for which research project, and the second specifies how the data is accessed.
1. The security-accessibility tradeoff
And this “how” is where things get interesting. Right now, the DDA specifies two modes of non-public data access: one mode where the requested data is directly transferred to the researchers for analysis and one where researchers can only access and analyze the data in secure processing environments (SPEs). We can think of data transfers and SPEs as two poles of a spectrum regarding data access safeguards: depending on the sensitivity of the data transferred, researchers will have to put in place more and more security measures until the data is not transferred anymore and can only be accessed through a SPE. However, with ever-increasing safeguards, the data analyses become more complex and complicated, to the point of being unfeasible.
Therefore, SPEs should be understood as the last resort for data access as they allow for the least flexibility in research. While providers of SPEs carry the burden of guaranteeing safeguards, these measures often impose strict limitations on the research process. Meta’s Researcher Platform, for example, not only requires a VPN connection but also implements a “cleanroom” that regularly deletes all data within it, making it untenable for replicable research. TikTok’s Virtual Compute Environment enables the provider to closely surveil researchers, as all results from queries within the environment are reviewed.
If data is directly transferred, researchers also have to implement technical and organizational measures in addition to basic access and non-disclosure agreements. However, determining which combination of security measures is appropriate is far from straightforward since data of different sensitivity will require different safeguards. The situation is further complicated because data providers can request amendments to the reasoned requests, potentially influencing what kind of data will be available, requiring researchers and DSCs to aim for a potentially moving target when proposing the details of access modalities. Common guidance on requirements for data transfers with different degrees of sensitivity, as currently under development by the Board for Digital Services, can serve to remedy some of the existing uncertainty but will not alleviate concerns that high-security thresholds may exclude researchers from institutions without ample data engineering resources from non-public data access. Thus, an accessible regime for non-public data access requires the creation, provision, and sustained funding of shared services or infrastructures that are capable of fulfilling the necessary safeguards.
2. Data processing, not data collection
Notably, Recital 16 of the DDA mentions “other access modalities to be set up or facilitated by the data provider.” To understand what these additional modalities could be, we need to consider the foundations of empirical research. In real life, researchers are not simply granted access to data that they can then analyze but have to systematically collect data. The specific methods used for data collection vary with research questions and the analysis methods of choice.
By only considering SPEs or data transfers, the DDA currently conceptualizes the researcher as a passive recipient of data, neglecting the fact that empirical research is often predicated on the active involvement of researchers during data collection. Researchers are rarely just handed some data and then go on to analyze it, but actively construct research questions and the means of testing them through data collection and analysis. This means that “other access modalities” are required to enable classical research practice by introducing researcher sandboxes that allow for various forms of data collection that platforms might not engage in. This should include the means to combine survey and platform data, as recently piloted by Meta, or functionality to conduct A/B tests, as proposed by the General Partnership on Artificial Intelligence, to test the effects of different recommendation algorithms, platform functionalities, or user interface designs.
The meaning of data access thus must not be limited to data created by the platforms. The final version of the delegated act should therefore make explicit reference to access modalities that enable active research and not just passive analysis - in short: it should not just enable data processing but also, and especially, data collection.
3. Blind trust, not validity
In the case of active data collection, researchers are responsible for ensuring that the inferences they draw are valid. This form of quality assurance can be achieved through a variety of means, which in most cases include controlling or measuring factors that might influence the results and would thus be essential to consider for any subsequent analysis.
In the case of passive data reception or data transfer, researchers have limited options to assess the quality of the data and ensure the validity of their analysis. In the case of public data, the data received can be validated against what is publicly available through other means, such as scraping data (TikTok’s API was shown to provide inaccurate metrics for views, likes, comments, and shares until recently). But quality assurance for non-public data is a whole different ball game as there exists no external standard for validation. This leaves but one option: blindly trusting that the platforms deliver or give access to adequate and complete data and that the conditions under which the data is collected do not change without the researcher's knowledge.
However, trust can be abused. During a study that used data provided by Facebook and examined how the platforms’ recommender algorithms affected how users were exposed to misinformation and political information, Facebook failed to disclose that it had introduced changes to said algorithms which likely affected the study’s results. This form of tampering was revealed more than a year after the original study was published based on an analysis of a public data set, which is not being updated by Facebook anymore but, luckily, still covered the period of data collection. Could we have found out otherwise? We simply cannot know.
The current draft of the delegated act makes such tampering easy for the platforms, as it requires a summary with a research abstract to be sent to the platform, unveiling the researcher’s intent and providing the platforms ample opportunity to adjust accordingly – without having to disclose any of the changes made to data or algorithms they made in response to the access request. Most research projects requesting data access will study the systemic risks these very platforms (and activities on these platforms) pose to the European Union. At the same time, the DDA lacks any consideration of the case in which the data provided might not be fit for use in the research project by failing to meet quality expectations.
4. The open question of quality control
While proclaiming scientific work as a key part of risk governance, the DSA does not seem to be particularly interested in ensuring that the research produced for said purpose is valid and reliable. Admittedly, the European Commission has started multiple investigations into platform compliance, all of which make reference to researcher data access - but it needs a starting suspicion to engage in proceedings. However, as it currently stands, this suspicion seems to rely exclusively on researchers conducting quality checks of the access process itself and the data received.
Currently, there is no indication of plans for a structural approach to quality assurance for research data access, which means shifting all responsibility onto individual researchers. As long as no institution volunteers to establish procedures and metrics to hold all relevant data providers to a minimal standard of data access quality, researchers will likely have to take the data they receive at face value. Individual researchers cannot be expected to validate all accessed data on their own.
For now, this means that we cannot confidently claim that research based on the current state of platform data access will be reliable, which does not bode well for the DSA’s risk governance scheme.
Effective risk governance: Under construction
To conclude: The draft delegated act introduces many commendable clarifications and structures that are likely to have a beneficial effect on researcher data access. However, it suffers from a limited conception of empirical research, which construes researchers as passive recipients of data. It also lacks awareness of the fact that data can be inaccurately reported or tampered with.
We hope this article has not just outlined the need for changes to the draft text (which can still be proposed as part of the ongoing feedback process) but also shown that the creation of shared services or infrastructures is necessary to ensure (1) that researchers without ample resources can access non-public data (2) the data accessed conforms to at least a minimal standard of quality.
A working data access regime will require funding, collaboration, and a lot of effort. Once it is established and up and running, however, it promises not just to lay the foundation for a deep understanding of the effects of digital technologies on societies and democracy but to enable effective governance based on empirically sound evidence. Regulators should realize that these goals can only be met through accessible and reliable data access and should thus ensure the necessary conditions.