Benj Edwards reports in ars technica:
Researchers from Stanford and Berkeley published a paper that fuels a common-but-unproven belief that the AI model has grown worse at coding and compositional tasks over the past few months. GPT-4's ability to identify prime numbers plunged from an accuracy of 97.6% in March to 2.4% in June. Theories about why include OpenAI "distilling" models to reduce their computational overhead to speed output and save GPU resources, fine-tuning to reduce harmful outputs, and conspiracy theories such as OpenAI reducing GPT-4's capabilities so more people will pay for GitHub Copilot. (But) the lack of transparency may be the biggest story. "How are we to build dependable software on a platform that changes in undocumented and mysterious ways every few months?"On Tuesday, researchers from Stanford University and University of California, Berkeley published a research paper that purports to show changes in GPT-4's outputs over time. The paper fuels a common-but-unproven belief that the AI language model has grown worse at coding and compositional tasks over the past few months. Some experts aren't convinced by the results, but they say that the lack of certainty points to a larger problem with how OpenAI handles its model releases.
In a study titled "How Is ChatGPT’s Behavior Changing over Time?" published on arXiv, Lingjiao Chen, Matei Zaharia, and James Zou cast doubt on the consistent performance of OpenAI's large language models (LLMs), specifically GPT-3.5 and GPT-4. Using API access, they tested the March and June 2023 versions of these models on tasks like math problem-solving, answering sensitive questions, code generation, and visual reasoning. Most notably, GPT-4's ability to identify prime numbers reportedly plunged dramatically from an accuracy of 97.6 percent in March to just 2.4 percent in June. Strangely, GPT-3.5 showed improved performance in the same period.
This study comes on the heels of people frequently complaining that GPT-4 has subjectively declined in performance over the past few months. Popular theories about why include OpenAI "distilling" models to reduce their computational overhead in a quest to speed up the output and save GPU resources, fine-tuning (additional training) to reduce harmful outputs that may have unintended effects, and a smattering of unsupported conspiracy theories such as OpenAI reducing GPT-4's coding capabilities so more people will pay for GitHub Copilot.
Meanwhile, OpenAI has consistently denied any claims that GPT-4 has decreased in capability. As recently as last Thursday, OpenAI VP of Product Peter Welinder tweeted, "No, we haven't made GPT-4 dumber. Quite the opposite: we make each new version smarter than the previous one. Current hypothesis: When you use it more heavily, you start noticing issues you didn't see before."
While this new study may appear like a smoking gun to prove the hunches of the GPT-4 critics, others say not so fast. Princeton computer science professor Arvind Narayanan thinks that its findings don't conclusively prove a decline in GPT-4's performance and are potentially consistent with fine-tuning adjustments made by OpenAI. For example, in terms of measuring code generation capabilities, he criticized the study for evaluating the immediacy of the code's ability to be executed rather than its correctness.
"The change they report is that the newer GPT-4 adds non-code text to its output. They don't evaluate the correctness of the code (strange)," he tweeted. "They merely check if the code is directly executable. So the newer model's attempt to be more helpful counted against it."
AI researcher Simon Willison also challenges the paper's conclusions. "I don't find it very convincing," he told Ars. "A decent portion of their criticism involves whether or not code output is wrapped in Markdown backticks or not." He also finds other problems with the paper's methodology. "It looks to me like they ran temperature 0.1 for everything," he said. "It makes the results slightly more deterministic, but very few real-world prompts are run at that temperature, so I don't think it tells us much about real-world use cases for the models."
So far, Willison thinks that any perceived change in GPT-4's capabilities comes from the novelty of LLMs wearing off. After all, GPT-4 sparked a wave of AGI panic shortly after launch and was once tested to see if it could take over the world. Now that the technology has become more mundane, its faults seem glaring.
"When GPT-4 came out, we were still all in a place where anything LLMs could do felt miraculous," Willison told Ars. "That's worn off now and people are trying to do actual work with them, so their flaws become more obvious, which makes them seem less capable than they appeared at first."
For now, OpenAI is aware of the new research and says it is monitoring reports of declining GPT-4 capabilities. "The team is aware of the reported regressions and looking into it," tweeted Logan Kilpatrick, OpenAI's head of developer relations, on Wednesday.
While the paper by Chen, Zaharia, and Zou may not be perfect, Willison sympathizes with the difficulty of measuring language models accurately and objectively. Time and again, critics point to OpenAI's currently closed approach to AI, which for GPT-4 did not reveal the source of training materials, source code, neural network weights, or even a paper describing its architecture.
With a closed black box model like GPT-4, researchers are left stumbling in the dark trying to define the properties of a system that may have additional unknown components, such as safety filters, or the recently rumored eight "mixture of experts" models working in concert under GPT-4's hood. Additionally, the model may change at any time without warning.
"AI model providers are trailing traditional software infrastructure best practices," says writer and futurist Daniel Jeffries, who thinks that AI vendors need to continue long-term support for older versions of models when they roll out changes "so that software developers can build on top of a dependable artifact, not one that is going to change overnight on them without warning."
One solution to this developer instability and researcher uncertainty may be open source or source-available models such as Meta's Llama. With widely distributed weights files (the core of the model's neural network data), these models can allow researchers to work from the same baseline and provide repeatable results over time without a company (like OpenAI) suddenly swapping models or revoking access through an API.
Along these lines, AI researcher Sasha Luccioni of Hugging Face also thinks OpenAI's opacity is problematic. "Any results on closed-source models are not reproducible and not verifiable, and therefore, from a scientific perspective, we are comparing raccoons and squirrels," she told Ars. "It's not on scientists to continually monitor deployed LLMs. It's on model creators to give access to the underlying models, at least for audit purposes."
Luccioni noted the lack of standardized benchmarks in the field that would make comparing different versions of the same model easier. She says that with every model release, AI model developers should include results from common benchmarks like SuperGLUE and WikiText, and also from bias benchmarks like BOLD and HONEST. "They should actually provide raw results, not only high-level metrics, so we can look at where they do well and how they fail," she says.
Willison agrees. "Honestly, the lack of release notes and transparency may be the biggest story here," he told Ars. "How are we meant to build dependable software on top of a platform that changes in completely undocumented and mysterious ways every few months?"
No comments:
Post a Comment