Naomi Shiffman is the Head of Data and Implementation for the Meta Oversight Board, where she leads a team in assessing the Board’s impact on Meta’s content ecosystem.
Social media platforms guard their data closely for a variety of reasons — chief among them user privacy, concern about misinterpretation, and protecting business interests. This has led to a frequently shifting landscape for researchers trying to understand the social impacts of products that have grown central to our society. The privacy concern has led platforms to primarily share data at scale that either is already public (see Twitter academic API, CrowdTangle, YouTube and TikTok’s researcher programs), or via overly burdensome privacy mechanisms, such as Meta’s differentially private URL Shares dataset. This often makes it very challenging to extract any meaning from the data at all.
Europe’s General Data Protection Regulation (GDPR) has been a key driver of fierce company protection of user data, in addition to the backlash following the Cambridge Analytica scandal and other major data leaks. However, under Article 40 of the EU Digital Services Act (DSA), platforms will be required to create researcher data-sharing programs and face this long-standing challenge head-on. The European Commission is currently inviting comment on how exactly this should work.
Luckily, platforms and regulators have a strong precedent in medical research. While it’s true that social media data has many privacy-sensitive elements, there’s as much– if not more– privacy-sensitive data held by the medical industry. The proven research data access framework developed under the US Health Insurance Portability and Accountability Act (HIPAA) demonstrates that independent research with sensitive data is possible. This framework and the many well-tested tools and programs in the medical research industry should serve as a model for implementation of the DSA, and for future US regulation.
Solutions from Medical Research
Medical research solutions for privacy-sensitive data include how to use synthetic data and tools for full de-identification, statistical de-identification, and data that has partial user information. Research partnerships with providers and insurers are also common. Each solution has its strengths and limitations.
One solution that is increasingly popular in the medical research industry, but that no social media platforms have yet to publicly test for independent research purposes, is synthetic data. This solution aims to address privacy concerns — by recreating data distributions without implicating actual user data, researchers may be able to study platform dynamics and impacts without a high risk of privacy breaches.
In the medical industry, synthetic data is used for several purposes, including simulation and prediction research; hypothesis, methods, and algorithm testing; epidemiology/public health research; health technology development; education and training; public release of datasets; and linking data. For example, a fully synthetic dataset was used in an epidemiological study to demonstrate that in the context of measles outbreaks, interrupting contact patterns via voluntary isolation and home quarantine were particularly important in reducing the number of secondary infections and the probability of uncontrolled outbreaks.
In the platform research context, while many or all of the medical research use cases may be relevant, synthetic data is often discussed in relation to two intersecting use cases.
Writing queries to run on real data: Synthetic datasets are sometimes suggested as a solution for helping researchers understand variables, including data format, type, and structure, in order to write queries without ever seeing real data. Researchers could then send those queries “over the wall” to platforms to be run on actual data, and get the results back without ever directly handling user data themselves. In the medical research context, the UK’s Clinical Practice Research Datalink (CPRD), a subset of the Medicines and Healthcare products Regulatory Agency, provides synthetic datasets for research alongside anonymized patient data — its Aurum sample dataset serves a similar query-writing use case. This dataset is a “medium-fidelity synthetic dataset that resembles the real world CPRD Aurum [a dataset of actual patient information] with respect to the data types, data values, data formats, data structure and table relationships.” Note that per CPRD, this can’t be used for in-depth analytical purposes.
The query-writing use case works if researchers already have robust research questions in mind, and are mainly using the synthetic data to write queries with the correct variables in order to send them over the wall to platforms. While synthetic data is one way to do this (as shown in the Aurum dataset example), it can also be addressed with a data structure map and data dictionary. To be clear, having a menu of options for data is truly a valuable asset to researchers, and one that few platforms provide currently, but synthetic data may be an over-engineered solution to this problem.
Understanding correlational relationships: A second use case for synthetic datasets is to help researchers understand and analyze correlational relationships between variables. In addition to informing the query writing described in the first use case, synthetic datasets can more broadly allow researchers to do exploratory research to develop a research question, or even conduct entire studies without ever needing to access real user data. In the medical research context, CPRD also provides two high-fidelity synthetic datasets that “replicate the complex clinical relationships in real primary care patient data while protecting patient privacy as they are wholly synthetic.” They can be used for complex statistical analyses in addition to other applications.
Executing a system that can support an analytical use case is complex. It’s relatively straightforward to create synthetic datasets that accurately replicate a data distribution along one axis (e.g. engagement on a set of content over time). However, once you grow beyond a single distribution (e.g. adding in content categories, user demographics, user behavior), the dataset will lose fidelity quickly if it isn’t anchored in the true population dataset and regularly resampled. This leads to one of the core challenges for synthetic data as a tool for platform research — platforms own all the data required to build a synthetic dataset, and the sampling required to create a valid synthetic dataset requires platforms to have clean, organized data and systems. As is fairly common knowledge at this point, platform data is generally messy, inconsistent, and not tracked well. It is wildly expensive to reset data flows, tracking and storage in a years-old system with significant technical debt — a description that applies to most (if not all) platforms.
Synthetic data is also not a bulletproof privacy solution. As the number of distributions and statistical properties that need to be preserved in a synthetic dataset grows, the re-identification risk of the dataset grows as well. Inherently, this means that any synthetic dataset that does not have additional privacy protection (such as the statistical de-identification discussed later in this paper) has to balance between usefulness for answering a large diversity of research questions, and actually serving the purpose of maintaining user privacy.
There’s also a trust problem. In order to built sufficient trust with the research community to actually rely on the synthetic data set, researchers would require a very robust set of literature confirming that synthetic distributions match distributions in the ground truth data — this literature is something that would need to be built through partnerships between internal and external researchers in the first year of a dataset’s existence.
Challenges notwithstanding, the conditions for platforms to offer researchers synthetic datasets will soon be within reach. Article 37 of the DSA forces much of the necessary data clean-up as a precondition for mandated risk assessment audits, because external auditors can’t audit illegible data and systems. As a result, now might be the perfect time to establish a synthetic data norm in platform data research — despite their limitations, synthetic datasets would go a long way towards addressing platform privacy concerns and making data that otherwise implicates user privacy more broadly accessible to the research community.
Fully de-identified data
HIPAA outlines the ways in which medical data should be de-identified for research. The first method (known as “safe harbor”) is to remove any identifying variables, including geographic location smaller than a state, which runs the risk of making the data useless for research, particularly when the research is on differential effects across demographics.
The benefit of using fully de-identified data for research is that there is virtually zero risk of de-identification, lowering the risk of sharing this data at scale. However, under HIPAA, dates must be stripped as part of de-identification. The lack of dates makes meaningful research possible only in specific contexts and about limited research questions — the same challenge is true for fully de-identified platform data.
Platforms already share fully de-identified data in some contexts (see Meta’s Data for Good mobility, population, connectivity, economic, social and forecast data). However, these solutions alone are insufficient for the scope and scale of data that robust platform research requires, and there are many other solutions that HIPAA enables which platforms could replicate under the right regulatory regime.
Partial Personal Identifiable Information (PII)
Despite the very tight protections HIPAA puts on medical information, it does still allow for partial PII to be shared in specific research contexts.
The Centers for Medicaid and Medicare services (CMS), the federal agency that administers Medicaid and Medicare, provides researchers with Medicare data — including some PII — in the Chronic Conditions Warehouse (CCW), which is refreshed monthly. The CCW was launched in response to the Medicare Modernization Act of 2003. It is administered by an academic consortium called ResDAC, among other subcontractors, and data privacy is maintained through strict agreements between CMS, ResDAC and the research entities, which include commitments not to attempt to re-identify individuals, to store data securely, to only publish results with data above a certain granularity, etc.
CCW data is shared via two routes: first, by shipping data to research entities on a disk — in this scenario, researchers must follow stringent protocols, such as only accessing data onsite at a university in a locked office. The second access method is the Virtual Research Data Center (VRDC). In this system, many data security controls are centrally managed, and researchers’ primary concern is not exporting data in violation of their research agreement. Note that both these data-sharing mechanisms entail a much higher level of trust between researchers and data owners than currently exists with platforms — it is CMS that takes the leap and trusts the researchers to abide by agreements, rather than adopting such stringent privacy protections that research becomes nearly impossible.
It’s hard to see platforms ever being willing to adopt this approach. One clear argument against this type of solution in the platform research context is that CMS data is owned by the federal government and not a private company, and that such a solution for platforms implicates another concern outlined in the introduction to this paper: protecting business interests. Despite this difference, the solution is worth considering because of the high sensitivity and heavy regulation of data involved in both contexts. It’s difficult to imagine that platform data is so much more sensitive than medical data that a similar solution could not be implemented, or that research contracts could not be written in such a way that business interests were protected.
Statistically de-identified data
Even if platforms never wanted to enter the world of sharing PII-linked data, health insurance companies have forged a regulation-approved pathway to sharing privately-owned data.
A second method of data de-identification outlined under HIPAA is statistical de-identification via expert determination. The law allows for “A person with appropriate knowledge of and experience with generally accepted statistical and scientific principles and methods for rendering information not individually identifiable… (i) Applying such principles and methods, determines that the risk is very small that the information could be used, alone or in combination with other reasonably available information, by an anticipated recipient to identify an individual who is a subject of the information; and (ii) Documents the methods and results of the analysis that justify such determination.” It is this second method that unlocks a range of more flexible options to support medical research.
Under this legislation, healthcare providers and insurers have a path to provide researchers with de-identified data directly. For example, United Health’s OptumLabs has provided researchers with around a dozen preset data views pulling from its claims data. If researchers want to link United Health data to another dataset, or request a unique view that is not among the preset options, they must work with statistical experts as described above to assess what information must be stripped from the United Health data view they’ve accessed or want to request in order to make the combined dataset privacy-safe. This system allows medical researchers to undertake exploratory research with the preset categories, and then leverage additional datasets to expand the research scope.
The business of data sharing has led to a flourishing industry of its own. Data vendors such as Datavant provide a HIPAA-protected clearinghouse for data from both insurers and providers, including privacy-protected medical claims, patient health records, lab data, pharmacy data, genomics, and more.
There’s also precedent for a cross-company data-sharing tool. The non-profit Health Care Cost Institute (HCCI) was founded in 2011 by a group of large health insurers and combines de-identified data from Aetna, Humana, Kaiser, and Blue Shield, with over 1 billion claims annually covering over one third of the US employee-sponsored insurance population. HCCI has normalized data across insurers to make the dataset legible — while this would be a challenge given the varying content and action types across social media platforms, it’s not impossible, and regulation normalizing reporting categories could help pave the way. Recent platform cross-company best practice initiatives, such as the Digital Trust and Safety Partnership, suggest that this type of data-sharing partnership may not be as far away as we think.
Medical research offers a solution to concerns of re-identification should data be shared alongside publication in the name of reproducibility. In this industry, the norm is to only publish results, not the underlying data. Even journal submissions, which may include data to support peer review that will not ultimately be published, include manual data review for risk of re-identification and restrictions on data granularity in case the data accidentally leaks. Reproducibility concerns are addressed by the relatively wide availability of de-identified data, as well as accommodations for dataset expansion as outlined above. This norm and path for dataset reproduction could be adopted for platform research as well.
Overall, the well-tested solutions forged by medical research, and protected under HIPAA, could be low-hanging fruit for replication in US platform regulation. Still, this solution is also only part of the larger puzzle.
As outlined in the first section of this article, many of the social science questions researchers want to examine can’t be answered with the solutions described above. This is in part because much of the data researchers require can only be collected in active experiments on the platforms, especially if researchers are trying to establish causal relationships between certain platform features and user behaviors or beliefs.
Medical research has a long history of partnerships between data owners, whether it be healthcare providers or insurance companies, and researchers. In one recent example, an external company tested interventions with Sutter Health diabetes patients. In another, researchers partnered with a health insurer to test whether wellness coaching made an impact on future medical claims.
The historical lack of trust between researchers and platforms has made similar research initiatives relatively few and far between, but examples like the US 2020 election research project and the “recommendations for social support” experiment establish some groundwork to advance these types of projects in the future.
There are many opportunities for platforms to model privacy-protecting research practices on best practices from the medical research industry, and the DSA is creating the conditions to do so. It’s also important to note that platforms have strong precedent for sharing publicly available data as described in the beginning of this paper. The DSA, as well as any US regulation passed to enable platform research, should not be used as an excuse to clamp down on these initiatives. The speed at which platform data changes makes live data pipelines like the Twitter firehose or CrowdTangle necessary, particularly for non-academic researchers and journalists who work on shorter timescales and whose work may inform civil society responses to political events, natural disasters, and crises of all types.
It will take a combination of approaches — synthetic data, live pipelines, de-identified data warehouses, collaborative projects, and more — to support truly robust social media platform research. It will also take serious trust-building between researchers and platforms. With DSA Article 40 mandating researcher access to data, platforms and regulators should look to US medical research as a model, and leverage every mechanism at their disposal to support multi-faceted insights into the impact of online platforms on individuals and society.
Thanks to Cameron Schilling for his input on this piece.