Betsy Masiello and Derek Slater are the founding partners of Proteus Strategies, a boutique tech policy strategy and advocacy firm.
As we approach the year’s midpoint, it would be the understatement of 2023 to say that there is a lot of buzz around AI, particularly large language models and other powerful, ‘foundation’ models. While somewhat cacophonous, this dialogue is on the whole a good thing – AI may impact every aspect of our lives in the not too distant future, and now is the time to design policy interventions that will influence that future for the better.
One area that is increasingly salient in the policy discourse is the role of open source software, and openness more generally, in the development of the AI market. There is a gradient from closed to open approaches to the development and release of AI products, the precise bounds of what constitutes ‘open source’ in this context is still contested, and there are a range of policy takes on the open source end of that gradient. Points of debate include who should have liability for open source vulnerabilities, and how open source may help address or exacerbate safety concerns. We focus here on the role of openness in preserving a competitive marketplace where new entrants, small companies, nonprofits and others will have access to the tools of innovation.
Many of the perspectives on this issue land at one of two extremes. Some, pointing to the significant costs of building powerful AI models, conclude that concentration in the AI market is inevitable, and suggest it should be the primary point for policy interventions. This perspective tends to relegate the role of open source to the sideline of the debate. At the other extreme are those who see open source AI as a means to confront existing concentrations of power in the technology industry and lower barriers to development of powerful AI models.
We think the answer is more uncertain than either side admits, and that AI policy interventions can and should be designed to leverage open source solutions without also expecting them to be policy panaceas. Open source AI will undoubtedly have an impact on how the market evolves, and policy interventions can and should take this into consideration. But the shape and extent of that impact is contingent on a number of different factors. Our aim in this piece is to tease some of those contingencies out and consider what sorts of interventions might tip the scale toward enabling competitive market access and development of this technology.
Open Source is already driving innovation in powerful AI models
Open source is important in the development of AI for reasons that mirror the role it has played in the development of the Internet ecosystem as a whole. The ability to share code and learn from other developers speeds up and broadens adoption of technology. Open access to the code and ingredients of an AI model can improve auditability and trustworthiness of that model, which could help third parties to evaluate models and identify ways to minimize bias and other harms.
Commercial investment in open source AI is already significant, and points to the critical role the approach will play in this ecosystem. Some of the biggest headlines have come from Hugging Face, a platform that supports and hosts myriad third-party AI projects and a company that itself has a multi-billion dollar valuation; Stability AI, best known for commercializing the open source Stable Diffusion text-to-image generative AI tool; and the release of Meta’s LLaMA model, which has subsequently been tuned to provide even greater performance and run on a wide variety of devices, including low-cost devices like phones and the Raspberry Pi. There are countless other open source large language models as well as start-ups, like Chroma and Together, that are using open source to support AI development.
As with open source generally, it’s not just the province of companies, but rather a community that includes nonprofits, researchers, and other contributors. Consider EleutherAI, a nonprofit collective that grew out of a Discord channel and has contributed a number of language models, datasets, and other resources to open source AI development.
Moreover, governments and other institutions are investing. The United Arab Emirates’ Advanced Technology Research Council released Falcon 40B, an open source model with performance comparable to many current closed source large language models.
Cumulatively, open source may already be driving commoditization of large language models. This can be a very good thing from an innovation and competition perspective – these powerful models could effectively become available for everyone to build on, creating no durable advantage for any one model or its creator.
Open Source AI is important, though not a panacea for market concentration
But the availability of open source AI alone doesn’t dictate how markets will evolve. The arc of open source development in software has taught us this much: the existence of open alternatives helps shape and constrain the power of market players, but doesn’t necessarily dictate a more diverse, competitive market structure. These business opportunities mean that open source won’t necessarily, by default, shift power away from incumbents or big businesses.
On the contrary, a small number of big companies may still win, constraining consumer choice and to some extent market dynamism. Look to the role of Mozilla’s Firefox in the highly concentrated browser market as one example of a valuable open source competitor in a highly concentrated market; Firefox has certainly driven the market forward and provided additional choice for consumers, but the browser market is still highly concentrated in terms of actual user adoption. Meanwhile, the Android operating system is open source, but the mobile market is still concentrated and subject to anti-competitive practices in a variety of ways. Google retains significant control over the development of the software, and governments around the world have pursued antitrust claims related to abuse of its dominant position in the markets for operating systems, as well as mobile communication applications and services.
As with open source software that has come before, we should expect that with AI there will be businesses built around complementary products and services, for example tuning an AI model for particular use cases or providing services on top of underlying open source AI models, and markets may concentrate around those complementary services.
With large language models and emerging forms of powerful AI, it really is too early to know how this market will shake out. As entrepreneur Elad Gil convincingly argues, we could see a market with one leading winner, an oligopoly, development of multiple niche markets, or a stack where value accrues to marketing and distribution more than the underlying technology. Value may accrue at different levels of the stack, to platforms and apps in differential ways, and at any layer of the stack it may accrue in ways that shift market structure over time.
Barriers to model development are large, but not necessarily insurmountable
What also too often gets lost in the discussion around openness in AI is a key distinction between open models that can be built on top of, and traditional open source software code. With traditional open source software, you can edit and modify the code itself; everything can be changed and improved. However, open source AI model characteristics are in some sense fixed; while having access to and using a pretrained model can dramatically lower barriers to deploying AI, it also means that one is locked into the design decisions that went into the model’s pretraining. This is where the argument that AI will necessarily inherit the market concentration of ‘Big Tech’ (or at least large, highly resourced companies) carries some weight. For instance, as we mentioned above, the open source LLaMA model allows many people to fine-tune and customize their own AI tools. But Meta built and trained the LLaMA model, and everyone who builds on top of LLaMA is fundamentally constrained by the original training and design choices Meta made. Reworking open source code to suit a given purpose can certainly take substantial resources, but developers typically do not face the same fundamental design constraints when they modify open code.
The barriers to entry in building these underlying AI models remain significant. As some have suggested, “eye-watering” sums of money can be required to develop and train the powerful ‘foundation’ models (including large language models) that have captured so much interest recently. Vast data sets, access to compute resources, and highly trained talent to develop the algorithms to train these models are all critical inputs for a new entrant. When it began work, OpenAI was a new entrant taking on established Big Tech companies, but required significant resources to succeed. GPT-3 reportedly cost over $4m to train, and GPT-4 reportedly cost more than $100m. And operating ChatGPT has been estimated to have cost $700,000 per day, just in compute cost.
While significant, these barriers to entry are not necessarily insurmountable. Take BLOOM as one example: with a research grant for compute resources, BLOOM can now generate text in 46 natural languages and 13 programming languages. After a year of collaborative efforts from over 1,000 researchers, the model was trained over a run of 117 days on a grant estimated to be worth ~€3M, using a supercomputer operated by French research institutions. Another example is GPT-NeoX, built by EleutherAI and trained on a public dataset it created called The Pile. It appears that the costs of developing these models are declining, and, while it is generally presumed that larger and larger datasets will drive performance gains, people are working on getting comparable performance with smaller datasets as well as using synthetic datasets. In terms of operating the models, we’re already seeing how models like LLaMA are being tuned to run on low-cost devices, as noted above.
To be clear, examples like BLOOM and GPT-J are still far from the proverbial “start-up in a garage,” and were not developed for deployments comparable to other commercial models and their benchmarks. Big Tech, and large, well-capitalized companies more generally, still have advantages.
But the extent of that advantage depends on a key question: even if larger companies can build the highest performing models, will a variety of entities still be able to create models that are good enough for the vast majority of deployed use cases? Bigger might always be better; however, it’s also possible that the models that smaller entities can develop will suit the needs of consumers (whether individuals or companies) well enough, and be more affordable. Segments of the market and different groups of users may operate differently. There may be some sets of use cases in which competition is strongly a function of relative model quality, while in other instances competition depends on reaching some threshold level of model quality, and then differentiation occurs through other non-AI factors (like marketing and sales). Users might in many cases need outputs that reach a given quality threshold, without necessarily being best in class; or, a model might serve a subset of users at very high quality levels and thus be sufficient even if it doesn’t hit performance benchmarks that matter to others.
Policy interventions to consider
Several policy interventions warrant consideration to preserve competitiveness in the AI market and ensure new entrants, as well as smaller organizations, are able to not only fine-tune the models developed by Big Tech and large entities, but also develop their own models from scratch.
Open datasets will be critical, in particular for mitigating the risk of market failures that unevenly distribute AI solutions — for example, by building non-English language data sets. Common Crawl is an open collection of ~4.5 billion web pages that provides an important training data set for many AI models in existence today. Support for its ongoing development, as well as for other types of datasets, can be helpful. Beyond preserving competitiveness, there is value in building high-quality, open data sets for use cases specifically in the public interest (for example: data sets to measure bias, data sets to support macroeconomic and environmental forecasting, data sets to catalyze public planning in strategic areas).
Exploration of “public options” holds tremendous potential. One proposal uses a 3-prong approach for public investment in AI, including support for creation of a “public data pool” and public compute resources, like the supercomputer used to build BLOOM. Stability AI supports creation of a “public foundation model” to support academic, small business and public sector applications. The UK has already committed £100m toward such an effort, and the US’ development of a National Artificial Intelligence Research Resource is working in a similar direction.
If supporting entry from new and smaller entities is a policy goal, then regulation of AI must also be proportionate and tailored so that it doesn’t create an undue barrier to entry. While that can mean exempting certain regulations below a threshold, it can also mean specifically supporting all entities in compliance. Solutions that ensure smaller-scale AI providers and open source models can incorporate trustworthy and safe design principles into their development, and also comply with regulatory requirements such that they are subject to adequate oversight, will help new entrants overcome initial regulatory hurdles as they get started.
These are just a few overall directions, and there are many other policy prescriptions that are possible. As we stated at the outset, we don’t have a firm prediction of where the market will go and what will be most effective (in fact, given how fast moving this space is, any prediction seems to sit on shifting ground). Rather, we hope that by outlining some of this uncertainty, we can move the conversation about the relationship between open source AI and the power of large firms away from fatalistic, binary points of view, and help highlight the many dependencies and possible futures ahead.