From Ambiguity to Accountability: Analyzing Recommender System Audits under the DSA
Peter Chapman / Feb 28, 2025Imagine if financial auditors had no standard processes or definitions to follow – just their own insights and discretion to decide what and how to audit. The results would not be very meaningful. This is where we find ourselves with the first round of audits under the EU's Digital Services Act (DSA). Without clear standards, the effectiveness of these audits depends on which firm conducts them and how key terms and processes are defined.
Last month, I wrote about how DSA risk assessments and audits are undermined by two gaps: (1) the failure to adequately assess the role of platform design in relation to risk; and (2) a lack of reporting about the data, metrics, and methods platforms and auditors used to evaluate risk and compliance.
This piece analyzes the different approaches platforms and independent auditors took in the first round of DSA audits related to recommender systems (DSA Articles 27 and 38).
Without consistency and comparability, these audits cannot meaningfully assess how effectively platforms are mitigating risk. By clarifying core definitions and identifying effective audit approaches now, the DSA audits can become a more effective accountability tool. To begin to make progress, let’s look at where we are today.
Auditing under the DSA regime
As many others have noted, the DSA introduces new terms and concepts for understanding platform risks, sometimes without a clear definition. This is expected given the novel nature of the DSA regulatory regime. In these early years of the DSA, a range of stakeholders – online platforms, civil society, the European Commission (EC), and national Digital Service Coordinators (DSCs) – must experiment, identify good practices, and share lessons learned. Such iteration is important to ensure an adaptive DSA regime that spurs innovation and responds to shifting technologies, risks, and mitigation strategies.
The need for iteration and flexibility, however, should not mean the audits fail to deliver on their potential as vehicles for transparency and accountability. The first round of independent audits of recommender systems reveals clear areas for immediate improvement.
Because the core definitions and methodologies were developed independently by platforms and auditors, significant inconsistencies exist in both risk assessment and audit processes. When evaluating the same requirements, platform auditors have differing expectations and employ different terminology. Reviewing audit findings related to recommender systems leaves us comparing apples to artichokes.
What do the DSA recommender systems audits assess?
Auditors assessed articles related to recommender systems across the DSA, including Articles 27 and 38. These articles require platforms to:
- In “plain and intelligible language” describe the “main parameters” of their recommender systems, including the “most significant” criteria for recommending information to users, and the “reasons for the relative importance of those parameters.”
- Platforms should enable users “to select and to modify at any time their preferred” recommender system; this option should be “directly and easily accessible” on the platform.
- Very large platforms should also offer at least one recommender system that is “not based on profiling,” as defined by the EU’s General Data Protection Regulation.
Several core terms - like “plain and intelligible language” or “most significant” recommender system criteria - are not defined in the DSA. In auditing Facebook, for example, Ernst & Young noted that “many of the obligations needed to be supplemented by the audited provider’s own legal determination, benchmark and/or definition of ambiguous terms.” This is a common approach and is an important way for platforms and auditors to clarify expectations. It is also an opportunity for stakeholders to align around a shared understanding of core expectations.
What are the key definitions of recommender systems?
Before your eyes glaze over reading a thousand words on audit definitions, these nuances really matter! There is significant variation in how platforms approach recommender system definitions. Some platforms have defined DSA-related terms, whereas others have chosen not to. Table 1 below summarizes some of the definitions given by platforms.
In many instances, definitions will be where the DSA’s substantive requirements actually become meaningful for users. A requirement for “directly and easily accessible,” for example, will only be meaningful if it is operationalized in a way that empowers platform users to shape their recommender systems.
The audits demonstrate important variation across foundational definitions:
- Main parameters of a recommender system: What exactly are the main parameters of a recommender system? As others have noted, recommender systems rely on hundreds of components that are constantly refined (not to mention the potential for “millions or billions of learned neural network weights”). The text of the DSA says major parameters are “at least: (a) the criteria which are most significant in determining the information suggested to the recipient of the service; (b) the reasons for the relative importance of those parameters.” Platforms have further defined this as “broad categories of signals” (TikTok), those “most significant in determining” recommendations (Pinterest, closely aligned to the DSA), and the “primary factors determining output” of the recommender system (Snap). Other platforms have simply left them open to interpretation, as in the case with Google/YouTube and X. Without understanding how a platform interprets and communicates “main parameters,” it is very difficult to understand how these components connect to system outputs. Yet there is work on which to build. The Knight-Georgetown Institute (KGI), where I work, has a forthcoming report offering concrete recommendations for how to publicly disclose information about the specific input data and weights used in the design of recommender systems.
- Plain and intelligible language: The DSA requires the main parameters of recommender systems to be spelled out in plain and intelligible language. What does this concretely mean in the recommender system context? Is it free of “acronyms or complex/technical terminology” (Pinterest), “straightforward vocabulary and easy to perceive, understand, or interpret” (Snap), or “written for a general audience with varying technical skill levels, inclusive of all users” (TikTok)? There's a subtle difference in expectations associated with each framing. These terms don’t need to be defined in a vacuum. Platforms, auditors, and the EC should build on important research into effective online disclosures, for example from the CyLab Usable Privacy and Security Laboratory (CUPS) at Carnegie Mellon University or the OECD.
- Direct and easily accessible: The DSA requires that recommender system selection and modification be “directly and easily accessible” for users. Again, this could mean different things to platforms and auditors. Is it “intuitive, reliable, and easy-to-find entry-points” (Facebook) or merely tools available to “all users” (Snap)? Or must the option be “initially surfaced as a pop-up … where the information is being ‘prioritized,’ and is easily accessible with one click” (Pinterest, emphasis added). Or might it be whatever TikTok’s redacted definition is? Digital platforms have a long history of conducting user testing to understand concepts like direct and easily accessible. The UK's Competition & Markets Authority’s evidence review of online choice architecture (OCA) and consumer and competition harm, for example, taxonomizes OCA and summarizes literature related to effective practices. A recent Centre on Regulation in Europe (CERRE) report maps the EU choice architecture regulatory environment and articulates principles to address risks. Platforms and auditors need to incorporate these and other benchmarks into the definition of “directly and easily accessible” modification.
Each of these examples demonstrates consequential decisions the auditors made that significantly affect the outcome of the audits. Not all elements of the DSA recommender system audits need standardized definitions. However, more standardized definitions of key terms, including the three described above, are needed to make the DSA audits meaningful. The EC, DSCs, civil society groups, and platforms should take the opportunity to identify which terms necessitate more specificity and guidance.
What methods were used to audit recommender systems?
Auditors used varied methodologies. Table 2 below summarizes the methodologies auditors used in assessing select recommender system compliance.
Given that the EC has provided limited guidance, variation in audit methodologies was foreseeable. Like with definition, different approaches early on in the DSA compliance regime can allow us to take stock of effective practices and emerging gaps. So, what methodological lessons does this first round of audits offer? There is certainly room for improvement.
- Outcomes: Audits appear split on whether they confirm user modification of the recommender system actually changes system outputs. Some audits appear to be desk-based, such as Facebook, where the auditor looks at the system card and “sample changes.” Pinterest’s auditor examined “model documentation and code and ascertained that the main parameters used in recommender systems were impacted by [the] user's decision on opting out from profiling.” Auditors for Google/YouTube and Snap, however, appear to describe a more involved process. With Snap, the auditor selected a sample of recommender systems from an inventory, and assessed whether “the algorithmic systems were tested and approved consistent with the audited providers policies and processes.” For Google/YouTube, the auditor appeared to actually confirm recommender system outcomes. The auditor inspected “the changes to the recommender system outputs before and after modifying the options and determined that the user’s selected options influence the main parameter.” Meanwhile, X’s audit was not able to proceed with planned “substantive testing” due to a lack of necessary audit resources. Looking across these approaches, YouTube’s audit appears to assess outcomes in the most robust way and could be incorporated across other audits.
- User experience: The audits also differ in how the auditor assessed the user interface for modifying recommender system parameters. The Facebook audit notes that user tools to modify recommender systems were “easily accessible from the specific section of the online platform’s interface where the information was being prioritized.” But the auditor doesn’t describe how it came to this determination or if any data was used to inform this assessment. Pinterest’s auditor confirmed modification was “direct and easily accessible,” which was informed by the definition Pinterest provided to the auditor described above. Google/YouTube’s auditor describes reviewing and assessing user journey processes (in the form of screenshots). Platforms regularly conduct user testing of design changes to assess impact. Were such user testing studies incorporated into the audit process when assessing accessibility for users? Mozilla has found, for example, that subtle design choices influence users' choice. To meaningfully assess the user interface, auditors surely must consider the metrics and testing that platforms use to assess their interfaces. Research suggests that small changes can have big impacts.
- Data: One of the most striking gaps in the audits is the lack of discussion of how audits were informed by actual platform data and assessments. When assessing the sufficiency of the inclusion of main parameters, what benchmarks did auditors use? Did they consider input data (i.e., sources of raw information), predictions, score, or weights assigned to any of these terms when assessing the most significant criteria in design? When assessing options for users to shape recommendation system design, did the auditors assess a combination of granular controls (e.g., over individual pieces of content) and coarser control over the inclusion of specific topics (e.g., political content)? Did the auditors consider the platform's use of item or user level surveys? Research suggests these are important tools to effectively shape the design of recommender systems. Moving forward, audits can be much clearer about their sources of data and evidence.
The Way Forward
For audits to deliver on their potential, we need (1) a common understanding of key terms and expectations, and (2) minimum standards of data, evidence, and documentation that should be incorporated into the audit (and risk assessment) process.
- Definitions: Key terms can be further defined and there is concrete work on which to build.
- The definition and operationalization of “plain and intelligible language” should leverage behavioral insights and be grounded in existing research into effective online disclosures, including by CUPS, the OECD, and other relevant sources.
- A definition of “main parameters” should explicitly spell out what is expected, including in relation to input data, values, and weights. This would allow stakeholders to more effectively understand and compare how recommender system design may contribute to risk and advance effective mitigation.
- The definition and operationalization of “directly and easily accessible” should account for research and evidence related to OCA and consider platform measures of accessibility of tools to influence recommender systems.
- Data, Evidence, and Documentation: There is a need to clarify that the audit process (and risk assessments) should further incorporate a range of relevant platform data sources. The recommender system audits did not appear to consider the wide range of existing platform metrics related to specific user behaviors or user experiences. Much of this data already exists within platforms and there are existing efforts to clarify what minimum levels of data should be expected to assess recommender systems, as well as DSA requirements more broadly.
Clear standards would benefit all involved. Platforms would gain a more predictable and coherent policy environment. European users would have confidence that meaningful standards are in place. Auditors would have a more objective playbook to follow, improving the efficiency of the audit process.
Now is the time to engage to improve the next round of assessments and audits.
Authors
