The Low-Down: A New Service Ranks AI Systems By Forcing Their Bots To Compete

Measures that matter. As AI use becomes more widespread, the cost of investment demands that organizations using it get a better idea of the utility - and economic return - it provides.

Rather than relying on the claims of the big players like OpenAI, Anthropic and their big tech backers, a new service gives users the ability to compare AI responses to a set of questions in order to determine which is better for their needs. Demand for the service has grown exponentially and the biggest players are following it obsessively. The only problem from a Silicon Valley standpoint is that it introduces a factor that tech has worked very hard to eliminate: competition. JL

Miles Kruppa reports in the Wall Street Journal:

Traditionally, AI technologies have been assessed through advanced math, science and law tests. Chatbot Arena lets users ask a question, get answers from two anonymous AI models and rate which one is better. Traditionally, AI technologies have been assessed through advanced math, science and law tests. Chatbot Arena lets users ask a question, get answers from two anonymous AI models and rate which one is better. Chatbot Arena now ranks more than 170 models that have garnered a total of two million votes. It has expanded to include separate rankings for categories such as creative writing, coding and instruction-following.
Record labels have the Billboard Hot 100. College football has its playoff rankings. Artificial intelligence has a website, run by two university students, called Chatbot Arena.
Roommates Anastasios Angelopoulos and Wei-Lin Chiang never imagined the graduate school project they developed last year would quickly become the most-watched ranking of the world’s best AI systems.
Traditionally, AI technologies have been assessed through advanced math, science and law tests. Chatbot Arena lets users ask a question, get answers from two anonymous AI models and rate which one is better.
Traditionally, AI technologies have been assessed through advanced math, science and law tests. Chatbot Arena lets users ask a question, get answers from two anonymous AI models and rate which one is better.
“Everyone is striving to be at the top of this leaderboard,” said Joseph Spisak, a director of product management at Meta Platforms working on AI. “It’s amazing to have a few students get together and be able to create that level of impact.”
The Chatbot Arena leaderboard on Dec. 4.
Chatbot Arena has taken off as tech companies spend billions on a bet that AI will be the defining technology of the coming decades. Any perceived advantage over the competition can make a big difference in attracting customers and talent, which is why so many tech executives and engineers follow Chatbot Arena the way Wall Street traders watch the markets.
University of California, Berkeley researchers launched Chatbot Arena in April 2023 to compare AI technology they had developed against other open-source chatbots by using a scoring system similar to one used by professional chess rankings. Within a week, the site had received 4,700 votes.
The project quickly caught the attention of big AI companies, which began asking the Chatbot Arena leaders to include their technologies in the rankings. OpenAI surged to the top of the leaderboard, only to be dethroned in March this year by its rival Anthropic.
After launching with nine AI models, Chatbot Arena now ranks more than 170 models that have garnered a total of two million votes. It has expanded to include separate rankings for categories such as creative writing, coding and instruction-following.
Angelopoulos and Chiang are still trying to complete their doctorates in computer science. It’s slow going, though, as running the leaderboard, which they do without compensation, takes most of their time.
“My girlfriend is hearing about Chatbot Arena all day and all night,” said Angelopoulos.
Scoring on vibes
Researchers say academic benchmarks have become less useful over time because their questions have made it into the large language models, or LLMs, that underpin AI applications—essentially letting them learn the answers in advance.
A look at how voting on Chatbot Arena works.
Google and OpenAI have claimed scores above 90% on a commonly used benchmark, known as Measuring Massive Multitask Language Understanding, released four years ago. One of its creators, Dan Hendrycks, recently began crowdsourcing questions for a new benchmark with the hardest possible questions that he has named “Humanity’s Last Exam.”
“A benchmark might be very challenging for LLMs when it’s first released, but then the next generation of LLMs comes and they reach near-perfect performance,” said Colin White, head of research at Abacus.AI, which has developed a benchmark called LiveBench that releases new questions monthly.
While Chatbot Arena’s head-to-head format can’t be aced like a test, it doesn’t always measure objective criteria or whether chatbots stick to verified facts. That is why some researchers call the approach “vibes-based evaluations.”
Chatbot Arena’s leaders said they have been transparent about the site’s limitations. They allow visitors to strip out variables based on style from the rankings, such as response length and formatting.
“Human preference is a critical signal,” Angelopoulos said. “There are subjective aspects to these queries.”
Mystery AIs
Angelopoulos and Chiang have enlisted more than a dozen other contributors to the project, which they hope will grow into something akin to a Wikipedia for AI. They said they aren’t considering making it a for-profit venture.
“The good thing is that there’s multiple possibilities,” Angelopoulos said.
As Chatbot Arena has grown, AI enthusiasts have scrutinized new entrants in hopes of identifying technologies that haven’t been released to the public. A mysterious model called “im-also-a-good-gpt2-chatbot,” released on Chatbot Arena in May, turned out to be GPT-4o, the technology that currently powers ChatGPT.
Elon Musk’s xAI, Meta and Google have also tested technologies on the site before releasing them to the wider public, according to Chatbot Arena. “We get company requests literally every day,” Chiang said.
Members of the Chatbot Arena project team, a mix of undergraduates and doctoral students, talk after a team meeting in their lab space at Soda Hall at UC Berkeley. Photo: Laura Morton for WSJ
Justin Wong whiteboards an idea in the lab housing Chatbot Arena. Photo: Laura Morton for WSJ
In October, an AI model from a Chinese company called 01.AI suddenly appeared in sixth place on the leaderboard, drawing attention to that country’s progress in the field. 01.AI’s chief executive, former Google executive Kai-Fu Lee, promoted the achievement with an X post noting the model had surpassed the original GPT-4o technology released by OpenAI.
The user feedback Chatbot Arena collects has become a valuable data source for developers. The site has periodically released 20% of the data it collects—just enough to be useful without making it possible for companies to game the system, Angelopoulos and Chiang said.
Google uses the data to look for patterns in how tech savvy users are interacting with chatbots, said Kate Olszewska, a Google product manager working on AI.
The search giant in November tied with OpenAI for first place on Chatbot Arena after listing an experimental version of its Gemini technology on the site. OpenAI a few days later leapt ahead of Google with an updated version of GPT-4o—only to have Google catch up with another model release soon after.
Oriol Vinyals, a Google executive who helps oversee the development of Gemini, promoted the news with a post on X pointing to the leaderboard results and topped with three popcorn emojis.

A Blog by Jonathan Low

Dec 6, 2024

A New Service Ranks AI Systems By Forcing Their Bots To Compete

Scoring on vibes

Mystery AIs

0 comments:

Post a Comment

contact

Search This Blog

Blog Archive

Labels

links