A Blog by Jonathan Low

 

Aug 11, 2024

Is OpenAI's Growth Plan Being Impacted By Access-Blocked Training Data?

As more sources take active steps to block OpenAI's access to their data, the answer appears to be, yes.

The days of Silicon Valley's high margin "your data is my data" days may be waning. JL 

Kyle Wiggers reports in Tech Crunch
:

More than 35% of the world’s top 1,000 websites now block OpenAI’s web crawler, and 25% of data from “high-quality" sources. (As a result) OpenAI has in recent months taken more incremental steps than leaps in gen AI, opting to fine-tune its tools as it trains the successor to its current leading models.  Should the current access-blocking trend continue, developers will run out of data to train gen AI models between 2026 and 2032. That — and fear of copyright lawsuits — has forced OpenAI to enter costly licensing agreements with publishers and data brokers.

Last year, OpenAI held a splashy press event in San Francisco during which the company announced a bevy of new products and tools, including the ill-fated App Store-like GPT Store.

This year will be a quieter affair, however. On Monday, OpenAI said it’s changing the format of its DevDay conference from a tentpole event into a series of on-the-road developer engagement sessions. The company also confirmed that it won’t release its next major flagship model during DevDay, instead focusing on updates to its APIs and developer services.

“We’re not planning to announce our next model at DevDay,” an OpenAI spokesperson told TechCrunch. “We’ll be focused more on educating developers about what’s available and showcasing dev community stories.”

OpenAI’s DevDay events this year will take place in San Francisco on October 1, London on October 30, and Singapore on November 21. All will feature workshops, breakout sessions, demos with the OpenAI product and engineering staff and developer spotlights. Registration will cost $450 (or $0 through scholarships available for eligible attendees), with applications to close on August 15.

OpenAI has in recent months taken more incremental steps than monumental leaps in generative AI, opting to hone and fine-tune its tools as it trains the successor to its current leading models GPT-4o and GPT-4o mini. The company has refined approaches to improving the overall performance of its models and preventing those models from going off the rails as often as they previously did, but OpenAI appears to have lost its technical lead in the generative AI race — at least according to some benchmarks.

One of the reasons could be the increasing challenge of finding high-quality training data.

OpenAI’s models, like most generative AI models, are trained on massive collections of web data — web data that many creators are choosing to gate over fears that their data will be plagiarized or that they won’t receive credit or pay. More than 35% of the world’s top 1,000 websites now block OpenAI’s web crawler, according to data from Originality.AI. And around 25% of data from “high-quality” sources has been restricted from the major datasets used to train AI models, a study by MIT’s Data Provenance Initiative found.

 

Should the current access-blocking trend continue, the research group Epoch AI predicts that developers will run out of data to train generative AI models between 2026 and 2032. That — and fear of copyright lawsuits — has forced OpenAI to enter costly licensing agreements with publishers and various data brokers.

OpenAI is said to have developed a reasoning technique that could improve its models’ responses on certain questions, particularly math questions, and the company’s CTO Mira Murati has promised a future model with “Ph.D.-level” intelligence. (OpenAI revealed in a blog post in May that it had begun training its next “frontier” model.) That’s pledging a lot — and there’s high pressure to deliver. OpenAI’s reportedly hemorrhaging billions of dollars training its models and hiring top-paid research staff.

OpenAI still faces many controversies, such as using copyrighted data for training, restrictive employee NDAs, and effectively pushing out safety researchers. The slower product cycle might have the beneficial side effect of countering the narrative that OpenAI has deprioritized work on AI safety in the pursuit of more capable, powerful generative AI technologies.

0 comments:

Post a Comment