Research reveals that incumbent big tech firms have dominant advantages in Gen AI similar to those in cloud computing, which limits opportunities for venture investors due to the usual issues with scale, cost and data quality.
That said, there appear to be growth possibilities at the end user interface and for those with specific domain expertise and data. But this is a market already consolidating so investors need to pick their shots with care. JL
Kartik Hosanagar and Ramayya Krishnan report in MIT Sloan Management Review:
Since the launch of ChatGPT, venture capital firms plowed money into generative AI startups. But who will capture the value of this market, and what are the determinants of value capture? The market for models is consolidating in the same way that most of the market share (and value) for cloud services was captured by Amazon, Google, and Microsoft. Applications that boast a large, loyal user base stand to capture the most value from generative AI by leveraging their distributional advantage. In the absence of model- or data-based differentiation, companies will need to distinguish themselves at the user interface. Entrepreneurs building startups without access to proprietary data or a large installed base will have to build to build on top of tasks, for initial products and services. In the months since the public launch of ChatGPT, massive investments have been made in the form of venture capital firms plowing money into generative AI startups, and corporations ramping up spending on the technology in hopes of automating elements of their workflows. The excitement is merited. Early studies have shown that generative AI can deliver significant increases in productivity. Some of those increases will come from augmenting human effort, and some from substituting for it.
But the questions that remain are, who will capture the value of this exploding market, and what are the determinants of value capture? To answer these questions, we analyzed the generative AI stack — broadly categorized as computing infrastructure, data, foundation models, fine-tuned models, and applications — to identify points ripe for differentiation. While there are generative AI models for text, images, audio, and video, we use text (large language models, or LLMs) as an illustrative context for our discussion throughout.
Computing infrastructure. At the base of the generative AI stack is specialized computing infrastructure powered by high-performance graphics processing units (GPUs) on which machine learning models are trained and run. In order to build a new generative AI model or service, a company might consider purchasing GPUs and related hardware to set up the infrastructure required to train and run an LLM locally. This would likely be cost-prohibitive and impractical, however, given that this infrastructure is commonly available through major cloud vendors, including Amazon Web Services (AWS), Google Cloud, and Microsoft Azure.
Data. Generative AI models are trained on massive internet-scale data. For example, training data for OpenAI’s GPT-3 included Common Crawl, a publicly available repository of web crawl data, as well as Wikipedia, online books, and other sources. The use of data sets like Common Crawl implies that data from many websites such as those of the New York Times and Reddit was ingested during the training process. In addition, foundation models also include domain-specific data that is crawled from the web, licensed from partners, or purchased from data marketplaces such as Snowflake Marketplace. While AI model developers release information of how the model was trained, they do not provide detailed information about the provenance of their training data sources. Still, researchers have been able to use techniques like prompt injection attacks to reveal the different data sources used to train the AI models.
Foundation models. Foundation models are neural networks broadly trained on massive data sets without being optimized for specific domains or downstream tasks such as drafting legal contracts or answering technical questions about a product. Foundation language models include closed-source models like OpenAI’s GPT-4 and Google’s Gemini as well as open-source models like Llama-2 from Meta and Falcon 40B from the United Arab Emirates’ Technology Innovation Institute. All of these models are based on the transformer architecture outlined in a seminal 2017 paper by Vaswani et al.While one could attempt to enter the generative AI stack by building a new foundation model, the data, computing resources, and technical expertise required to create and train high-performing models form a significant barrier to entry that has resulted in a small number of high-quality large foundation models.
RAGs and fine-tuned models. Foundation models are versatile and have good performance for a wide range of language tasks, but they might not have the best performance for specific contexts and tasks. To perform well at context-specific tasks, one might have to bring in domain-specific data. A service that aims to use an LLM for a specific purpose, such as helping users troubleshoot technical issues with a product, could take one of two approaches. The first involves building a service that retrieves snippets of information relevant to the end user’s question and appends the information to the instruction (prompt) sent to the foundation model. In our example of helping users troubleshoot problems, this entails writing code that extracts the pertinent material from the product manual that is most closely related to the end user’s question and instructing the LLM to answer the user’s question based on that information snippet. This approach is called retrieval-augmented generation (RAG). Foundation models have limits on the size of the prompts they can accept, but they can be as large as roughly 100,000 words. The expenses incurred in this approach include the foundation model’s API costs, which increase based on the size of the input prompt and the size of the output from the LLM. As a result, the more information from product manuals that is sent to the LLM, the greater the cost of using it.
An alternative approach that is more expensive in terms of upfront computational costs is to fine-tune the model. In contrast to providing information snippets from a product manual (aka context) through prompts, this approach further retrains the foundation model’s neural network using domain-specific data. ChatGPT is fine-tuned to accept instructions and have a conversational interaction with people. Fine-tuning will involve retraining a pretrained foundation model like Llama or GPT-4 on a domain-specific data set.
While the RAG approach might involve high API costs because of the need to provide the model with long prompts, it is easier to implement than a fine-tuning approach and does not incur the compute costs of retraining the neural network on a new data set. As a result, the RAG approach has lower setup costs but higher variable fees from sending information to the foundation model each time an end user asks a question. However, the fine-tuning approach, while more expensive upfront, has the potential to produce better outcomes and, once complete, is available for future use without the need to share context for each query, as is the case with RAG. Companies with technical expertise might find that there is value to be captured in building out the tooling layer of generative AI, which enables approaches such as fine-tuning and RAGs, either for their own products or as a service to other companies.
LLM applications. The final layer of the stack consists of applications that can be built on top of either the foundation or fine-tuned model to serve a specific use case. Startups have built applications to draft legal contracts (Evisort), summarize books and movie screenplays (Jumpcut), or even respond to technical troubleshooting questions (Alltius). These applications are priced like traditional software-as-a-service applications (with monthly usage fees), and their marginal costs are mostly tied to application hosting fees on the cloud and API fees from foundation models.
In the past few months, tech giants and venture capitalists have made massive investments in each layer of this stack. Dozens of new foundation models have been launched. Similarly, companies have created task-specific models fine-tuned on proprietary data in the hope that they will give them an edge over competitors. And thousands of startup ventures are building applications on top of various foundation or fine-tuned models.
Lessons From the Cloud
Which players stand to gain the most value from these investments? Dozens of foundation models have been launched in the past several months, many of which can deliver performance comparable to the more popular foundation models. However, we believe that the market for foundation models could well consolidate among a few players in the same way that most of the market share (and value) for cloud services has been captured by the likes of Amazon, Google, and Microsoft.
There are three reasons most cloud infrastructure startups failed (or were bought by bigger rivals), and those reasons apply to generative AI as well. The first is the cost and capability required to create, sustain, and improve high-quality technical infrastructure. This is further compounded in the case of LLMs, given that the cost of the data and compute resources required to train these models is very high. This includes the cost of GPUs, which are in low supply relative to their demand. The computational cost of the final training run of Google’s 540 billion parameter PaLM model alone was somewhere between $9 million and $23 million, based on an outside estimate (and the total training cost was likely several multiples of that). Similarly, Meta’s investments in GPUs in 2023 and 2024 is estimated to be over $9 billion. Additionally, building a foundation model requires access to massive amounts of data, as well as significant experimentation and expertise, all of which can be very costly.
The second is demand-side network effects. As the ecosystem around these infrastructures grows, so too will barriers to new market entrants. Take PromptBase, an online marketplace for prompts to feed into LLMs. The greatest number of prompts on offer are for ChatGPT and the popular image generators Dall-E, Midjourney, and Stable Diffusion. A new LLM might be compelling from a technology standpoint, but if users have a trove of prompts that work well in ChatGPT and no proven prompts for the new LLM, they are likely to stick with ChatGPT. While underlying architectures are similar across LLMs, they are heavily engineered, and prompting strategies that work well with one LLM might not work as well with others. Meanwhile, user numbers are the gift that keeps on giving: ChatGPT, with its large user base, attracts developers to build plug-ins on top of it. Further, data from LLM usage creates a feedback loop that allows the models to improve. In general, the most popular LLMs will improve faster than smaller ones because they have more user data to work with, and those improvements bring in even more users.
A third factor is economies of scale. The supply-side benefits of resource pooling and demand aggregation will allow LLMs with large customer bases to have a lower cost per query than startup LLMs. These benefits include being able to negotiate better rates from GPU vendors and cloud service providers.
For the reasons above, even though new foundation models are being released on a regular basis, we expect that the foundation model market will consolidate around a few major players.
To Build or Borrow?
Companies looking to break into the market for generative AI services will have to decide whether to build applications on top of third-party foundation models like GPT-4 or build and host their own LLMs (either building on top of open-source alternatives or training them from scratch). Building on third-party models can come with security risks, such as the potential exposure of proprietary data. This risk is partly mitigated by using trusted cloud and LLM providers that can guarantee that customer data remains confidential and is not used to train and improve their models.
An alternative is to leverage open-source LLMs, such as Llama 2 and Falcon 40B, without relying on third-party providers like OpenAI. The appeal of open-source models is that they provide companies with complete and transparent access to the model, are often cheaper, and can be hosted on private clouds. However, open-source models are currently lagging GPT-4 in terms of performance on complex tasks such as code generation and mathematical reasoning. Further, hosting such models requires internal technical skills and knowledge whereas using an LLM hosted by a third party can be as simple as signing up for a service and using the provider’s APIs to access the functionality. Cloud providers have increasingly started to host open-source models and offer access to them via an API to address the concern.
A final alternative is for companies is to build their own private models from scratch. The private-versus-public LLM question hinges on whether a company has the resources to deploy, manage, maintain, and continually improve that technology. Vendors are arising to help with these nontrivial tasks, but the hassle-free appeal of one-stop shopping at the big cloud infrastructure providers that are offering ready-to-go foundation models on their platforms will be appealing.
The performance of an LLM is determined by the architecture of the neural network — the model — and the quantity and quality of the data it is trained on.Transformer models require large amounts of data. High-performing transformer models — ones that generate accurate, relevant, coherent, and unbiased content and are less likely to hallucinate — operate on a scale of over a trillion tokens (a basic unit of text for an LLM, often a word or sub-word) of internet data and billions of parameters (variables of a machine learning model that can be adjusted through training). The biggest LLMs have the best performance on a wide variety of tasks. But data quality and distinctiveness can be equally important to the effectiveness of LLMs, and models trained or fine-tuned on domain-specific data can outperform larger general-purpose models on specialized tasks in specific domains. For this reason, organizations that have access to large volumes of high-quality data in specific domains might have an advantage over other players in creating specialized models for their sectors.
For example, Bloomberg leveraged its access to financial data to build a model specialized for financial tasks. BloombergGPT is a 50 billion parameter model, compared with ChatGPT-3.5’s roughly 475 billion parameters, and yet early research showed it outperforming ChatGPT-3.5 on a series of benchmark financial tasks. Both models are trained on large data sets — BloombergGPT is trained on 700 billion tokens, and ChatGPT-3.5 on 500 billion. But more than 52% of BloombergGPT’s training data set consists of curated financial sources, which gives it an advantage on domain-specific tasks. That said, a recent study showed GPT-4, with enhanced prompts, outperforming BloombergGPT for simple financial tasks, such as sentiment analysis.We attribute this to the greater model size of GPT-4 (which has been reported to have over a trillion parameters). In short, while models built on domain-specific data can deliver strong performance with smaller and cheaper models, training them should not be a onetime exercise and will require continual investment, given that generalist models are constantly growing and improving.
Incumbents in other sectors with domain-specific data, such as insurance, media, and health care, are likely to benefit from specialized LLMs as well.
The Importance of User Interface
Companies building applications on top of foundation models (often referred to as GPT wrappers) face the conundrum that competitors can easily replicate the functionality of their applications by building on top of the same or superior foundation models. In the absence of model- or data-based differentiation, companies will need to distinguish themselves at the end of the pipeline — the interface where machine intelligence meets the user.
We believe that the advantage here lies with apps with an established audience. Take the example of GitHub Copilot, a generative AI-powered tool that writes code. It runs on OpenAI’s Codex code generator (and, more recently, GPT-4) and is distributed through GitHub, a software development platform owned by Microsoft. The 100 million developers who use GitHub provide it with a massive distributional advantage over startups working on similar code-generation products. Analytics from such a large user base also gives GitHub a distinct advantage in terms of improving the model and integrating it into their software development platform. (However, companies will face a challenge in balancing AI model enhancement with user privacy concerns. A case in point is the public outrage that prompted Zoom to reverse a change to its terms of service that would have allowed it to use customer content for training AI models.)
Tech incumbents will have a natural inclination toward vertical integration, where the LLM creator also owns the app. Google is already integrating its LLM capabilities into Google Docs and Gmail, just as Microsoft is doing with its suite of products through its partnership with OpenAI.
At the same time, incumbents in specific domains that do not have their own LLMs might find success building applications tailored for their existing user base on top of third-party LLMs. They can leverage their last-mile access to easily bundle new AI-enabled capabilities into their existing offerings and get them into the hands of customers faster than competitors can. In other words, if there is a large and competitive market of LLMs that provide roughly equivalent features, applications that boast a large and loyal user base stand to capture the most value from generative AI by leveraging their distributional advantage at the top of the generative AI stack. This advantage, coupled with the data advantages that incumbents already possess, poses challenges for new entrants.
What are the implications for managers who are formulating their generative AI strategies? If their company is an incumbent in its industry, they need to think hard about what complex domain-specific tasks can be better addressed using proprietary data. This proprietary edge will allow a business to deliver unique value to its customers. When the functionality of a company’s AI-enabled services and products are easily replicated by competitors — either because they have similar data or because generalist LLMs can achieve similar capabilities — its ability to immediately roll out these applications to a large installed base and iterate based on massive amounts of customer data will have to be the source of its competitive edge. Entrepreneurs and startups will have to use larger general-purpose models that match fine-tuned models, at least on simpler domain-specific building startups without access to proprietary data or a large installed base will have to recognize and build around their disadvantages. They will have to build on top of tasks, to build their initial products and services. They will have to rely on their agility and count on incumbent inertia to capitalize on an earlier start than powerful but slower incumbents.
Copyright Issues
A wide variety of content creators have raised concerns about copyright and intellectual property used to train LLMs. The New York Times has filed a lawsuit against OpenAI claiming that it used the Times’ content to train its models and to create substitutive products. Other companies, authors, and programmers have also filed lawsuits against owners of LLMs for similar reasons. While OpenAI has argued that training models on copyrighted content falls under fair use provisions of copyright law, it has also recognized the need to establish new content use agreement models and signed a deal with media company Axel Springer to use its content to train OpenAI models.
Regardless of how these lawsuits are resolved, the underlying concerns around training models on intellectual property created by other parties are likely to result in a greater advantage for established players in generative AI. Unlike smaller or newer entities, companies like Google, Microsoft, and Meta have the resources to battle copyright claims in court, sign licensing agreements with content creators, and indemnify users from any copyright claims on content created using their models.
Hang Together or Hang Alone
What does this mean for companies offering generative AI products and services? Those that are looking to create a new foundation model might struggle to compete on model performance with existing players. The way to compete beyond model performance is to build out the ecosystem as well as the tools for each layer of the stack, such as tools that make it particularly easy for an application developer to fine-tune or apply RAG on top of a foundation model. If an organization has large-scale domain-specific data, a domain-specific LLM will allow it to differentiate from the general-purpose models.
Companies seeking to experiment with and eventually integrate generative AI into their workflows or products need to quickly develop clarity on the use cases and tasks that are good candidates for a proof of concept. Some of the relevant criteria are based on the answers to three questions:
1. Is the use case unregulated? Companies in highly regulated industries, such as health care and financial services, might be subject to significant compliance and audit burdens that make it infeasible for them to plan product development and launch in short cycles. Given the rapid progress and changes in generative AI, it is important to iterate fast, so multiyear product cycles are not ideal test beds for generative AI today.
2. Are errors manageable? Erroneous or biased outputs are inevitable with generative AI. There are many examples of LLM hallucinations as well as biases in AI-generated text and images. Similarly, there are examples of data confidentiality issues from enterprise use of third-party generative AI models. Developing internal competency in detecting and correcting AI hallucinations and ensuring LLM security and data privacy will be crucial.
3. Do you have unique data and domain knowledge to enable and govern fine-tuning or RAG? If an application is merely a GPT wrapper, the functionality is easily replicable and the application is likely to neither provide any competitive advantage nor solve unique industry- or company-specific challenges. Leveraging proprietary, first-party data will be critical to delivering value and building differentiation.
Once a compelling use case has been identified, the next decision is whether to make or buy a model and/or applications on top of it, taking into account vendor strategies and the organization’s own technology maturity and strategy. Building an internal team with awareness of relevant technologies and the metrics to assess ROI is essential to making an informed decision. Given how quickly this technology is developing, a company’s decision to build its own models should account for its ability not just to build a model but also to continue advancing it to keep pace with the market.
0 comments:
Post a Comment