Generative AI tools are quickly 'running out of text' to train themselves on, UC Berkeley professor warns

Picture of phone that displays OpenAI logo.
OpenAI's ChatGPT is among many chatbots trained on large language models that may be "running out of text" to train on, said Stuart Russell, a computer science professor at the University of California, Berkeley.Beata Zawrzel/NurPhoto via Getty Images
  • A Berkeley professor said AI developers are "running out of text" to train chatbots at a UN summit.

  • He added that AI's strategy behind training large language models is "starting to hit a brick wall."

  • It's the latest concern raised regarding OpenAI and other AI developers' data-collection practices.

ChatGPT and other AI-powered bots may soon be "running out of text in the universe" that trains them to know what to say, an artificial intelligence expert and professor at the University of California, Berkeley says.

Stuart Russell said that the technology that hoovers up mountains of text to train artificial intelligence bots like ChatGPT is "starting to hit a brick wall." In other words, there's only so much digital text for these bots to ingest, he told an interviewer last week from the International Telecommunication Union, a UN communications agency.

This may impact the way generative AI developers collect data and train their technologies in the coming years, but Russell still thinks AI will replace humans in many jobs that he characterized in the interview as "language in, language out."

Russell's predictions widen the growing spotlight being shone in recent weeks on the data harvesting conducted by OpenAI and other generative AI developers to train large language models, or LLMs.

The data-collection practices integral to ChatGPT and other chatbots are facing increased scrutiny, including from creatives concerned about their work being replicated without their consent and from social media executives disgruntled that their platforms' data is being used freely. But Russell's insights point toward another potential vulnerability: the shortage of text to train these datasets.

A study conducted last November by Epoch, a group of AI researchers, estimated that machine learning datasets will likely deplete all "high-quality language data" before 2026. Language data in "high-quality" sets comes from sources such as "books, news articles, scientific papers, Wikipedia, and filtered web content," according to the study.

The LLMs powering today's most popular generative AI tools were trained on massive amounts of published text culled from public online sources, including from digital news sources and social media sites. The "data scraping" of the latter is what drove Elon Musk to limit how many tweets users can view daily, he's said.

In an email to Insider, Russell said many reports, although unconfirmed, have detailed that OpenAI, the company behind ChatGPT, purchased text datasets from private sources. Russell added that while there are possible explanations for such a purchase, "the natural inference is that there isn't enough high-quality public data left."

OpenAI did not immediately respond to a request for comment ahead of publication.

Russell said in the interview that OpenAI, in particular, had to have "supplemented" its public language data with "private archive sources" to create GPT-4, the company's strongest and most advanced AI model to date. But he acknowledged in the email to Insider that OpenAI has yet to detail GPT-4's exact training datasets.

Several lawsuits filed against OpenAI in the past few weeks allege the company used datasets containing personal data and copyrighted materials to train ChatGPT. Among the biggest was a 157-page lawsuit filed by 16 unnamed plaintiffs, who claim OpenAI used sensitive data such as private conversations and medical records.

The latest legal challenge, presented by lawyers for comedian Sarah Silverman and two additional authors, accused OpenAI of copyright infringement due to ChatGPT's ability to write up accurate summaries of their work. Two additional authors, Mona Awad and Paul Tremblay, filed a lawsuit against OpenAI in late June that makes similar allegations.

OpenAI has not made any public comments on the slate of lawsuits filed against it. Its CEO Sam Altman has also refrained from discussing the allegations, but in the past has expressed a desire to avoid legal troubles.

At a June tech conference in Abu Dhabi, Altman told the audience he had no plans to issue an IPO for OpenAI, reasoning that the company's unorthodox structure and decision-making could lead to clashes with investors.

"I don't really want to be like sued by a bunch of like public market, Wall Street whatevers," Altman said.

 

Read the original article on Business Insider

Advertisement