Home

Donate

Data Sharing and the Delegated Act of Europe's DSA

Luca Belli / Dec 11, 2024

While these days most of the focus in Brussels is on the EU GPAI Code of Practice, something else also recently happened that is going to shape our digital future. The European Commission released a draft for the Delegated Act of Article 40 of the Digital Services Act (DSA) for public comment. I believe that this is going to be one of the most impactful transparency requirements for online service providers, and we should pay attention to it. Of course, I am also biased since I have worked with the Commission as an external expert on this project.

The Digital Services Act

The Digital Services Act is the first comprehensive regulation for online platforms that became law in 2022 together with the Digital Markets Act (DMA). Unlike GDPR—which impacts every organization in the same way, regardless of their size or scope—the DSA has special regulatory requirements for what are called Very Large Online Platforms (VLOPs) and Very Large Online Search Engines (VLOSEs). Those are defined as the services that have more than 45 million users in the territory of the European Union, and they are designated by the Commission itself. The first list of VLOPs and VLOSEs was released in April 2023, and in the latest update, 25 services are listed.

While there are many transparency requirements under the DSA for VLOPs and VLOSEs, I am going to focus on Article 40. This article mandates data sharing with external vetted researchers (roughly speaking, those free of commercial interests and other forms of conflict of interest) to study the systemic risks to the EU. Established by Article 40.13 of the DSA, this Delegated Act lays down some concrete details on how such sharing will work in practice.

Data Sharing and Why It Matters

Before looking at the draft of the Delegated Act itself, just a quick review of the two kinds of data sharing requirements. In Article 40.12 VLOPs and VLOSEs have to share—for free—data that is considered public for third parties, such as non-profit and civil society. If a researcher is also considered “vetted” (as defined by Article 40.8), they can request special data access that also includes private data.

I want to reflect for a moment on the significance of the magnitude of these requirements, especially the latter.

Until today there was no kind of third-party accountability for online platforms or search engines. In most cases the research and audits were done externally, with some small exceptions, like the internal audits that myself and my team did while at Twitter. Other times platforms would pledge transparency and open datasets to external researchers just to shut the efforts down or directly ban researchers from their platforms. That’s why it is so important to safeguard the independence of researchers from platforms’ swinging moods, and the DSA can be a big part of that. With online platforms shaping a big portion of our online lives, it’s vital that external entities can measure their impact.

Under the DSA, the level of access is unprecedented. Researchers will have a chance to get meaningful data of the most precious kind: unaggregated. Even when datasets are released, they might be released in aggregate form to (rightfully) protect the privacy of each user. Even processes that are very transparent and scientific (e.g., the approval of new drugs) rely on aggregated results: scientists present the conclusion of their study, arguing that the drug they developed is safe and beneficial. However, they are not required to share what was the effect on each single patient; they show the results at a population level.

Of course, as we’ll see in a moment, the Commission knows that with great data comes great responsibility.

The Delegated Act

So what are some of the main points of this Delegated Act?

  • Access Portal. To simplify and centralize things, a DSA Access Portal will be created, which will be interoperable with already existing sharing infrastructure (specifically the AGORA platform), so there will be no need to create another account.
  • Inventory. Data providers should produce a data inventory! This is in recognition of the fact that finding the right data is sometimes 90% of the work. And yes, this is a difficult task even when you are inside the company.
  • Timing. Things will move fast. After submitting an application, the researcher will be notified in 5 days if it is complete or if supporting information is missing. After that, it will take 21 working days to transmit the requests to the data providers, who will have just 24 hours to respond.
  • DSCs. A lot of power (and work!) is given to the Digital Services Coordinators (DSCs). They are the ones who will have a lot of power (and work!). First they are the ones who have to verify if the data access request is appropriate for the study of the Systemic Risks. As researchers, we always have the temptation to ask for all the data available, but because of privacy and practical reasons, those need to be scoped to the research question at hand. Similarly, they can ask the researcher to explain how the data was selected and why other available data sources (e.g., public data) are not enough for the study (Article 8 (6)).
    • Of course, data sharing might expose platform vulnerabilities, and the DSCs are the ones to check for an appropriate balance between the data requested and vulnerability.
    • They must also verify that the researchers have "access to the available legal, organizational and technical means to meet the requirements of confidentiality, data security and protection of personal data.” Finally, if there is a disagreement between researchers’ requests and platforms, the platforms can request mediation with an impartial third party. Note that the platforms will pay the cost of the mediation.
  • Software. Platforms can’t restrict the tools and software libraries used by researchers unless it’s specified in the data request.
  • Public requests. Not only will data requests have to be standardized, but they will also have to have a public overview. It will be interesting to see what people are going to study, what organizations they belong to, and how they will formulate their research questions.
  • Data Privacy. The Commission takes data privacy seriously. Not only is GDPR mentioned many times, but researchers have to demonstrate they have and can maintain a secure processing environment that will, among other things, log access and have enough compute for the purpose of the research.
  • A/B Tests. Last but definitely not least, my favorite part is buried in Recital 12, where A/B tests are explicitly mentioned. This means platforms will have to share their data on experiments that have been running! A/B tests are currently the gold standard for assessing the impact of a specific change and are usually how platforms make their ship/no-ship decisions. This will uncover many interesting insights into the platform's internal organizational structure. I am sure platforms will fight back, claiming privacy or IP reasons.

What’s Next

This Delegated Act adds many missing details on the implementation of the important Article 40. However, it’s a starting point, not the full story. There are still many open questions that are going to be decided at the national level via their Digital Services Coordinators, including:

  • Discoverability. While platforms have to provide a list of their data, how easy it is to discover what data is available is a very important detail. As mentioned, a surprisingly large chunk of time is spent finding the right data for the job and then making sure that the data actually represents what it seems to represent.
  • Secure processing. Multiple mentions of secure processing environments have been made, but the details are still lacking. What standards must be satisfied to be classified as a secure processing environment? Can researchers use third-party servers for that purpose? Or should everything be done in-house?
  • Scraping. What is the relationship between Article 40 and scraping? Researchers see scraping as a fundamental tool to preserve their autonomy and help with the research. Platforms see that as a behavior that is against their terms of service and are known to use the infringement to shut research off of their platform.
  • Mediation. If there is no resolution during the mediation, the mediator can declare it closed. What happens then? Will the researcher not get the data and have to submit a different application? Does unsuccessful mediation prevent them from presenting an application that is too similar to it?

The Commission closed the feedback period for public draft on Dec 10th. It will incorporate the appropriate feedback, and an updated Delegated Act is expected to be released—and to be in full force—sometime in the first quarter of 2025. In the meantime, DSCs (especially the one in Ireland, where most of the designated platforms are based) will have to staff up and prepare for all the vetted researchers' applications.

Related Reading

Authors

Luca Belli
Luca Belli is a data scientist with a policy twist. He recently spent over a year as a UC Berkeley Tech Policy Fellow and as a Visiting AI Fellow at the National Institute of Standards and Technology (NIST) where his work included leading the Red Teaming effort for the Generative AI Public Working G...

Topics