Generative AI tools are quickly 'running out of text' to train themselves on, UC Berkeley professor warns

Gabriel Rivera

Updated Tue, Jul 11, 2023, 12:20 PM4 min read

OpenAI's ChatGPT is among many chatbots trained on large language models that may be "running out of text" to train on, said Stuart Russell, a computer science professor at the University of California, Berkeley.Beata Zawrzel/NurPhoto via Getty Images

A Berkeley professor said AI developers are "running out of text" to train chatbots at a UN summit.
He added that AI's strategy behind training large language models is "starting to hit a brick wall."
It's the latest concern raised regarding OpenAI and other AI developers' data-collection practices.

ChatGPT and other AI-powered bots may soon be "running out of text in the universe" that trains them to know what to say, an artificial intelligence expert and professor at the University of California, Berkeley says.

Stuart Russell said that the technology that hoovers up mountains of text to train artificial intelligence bots like ChatGPT is "starting to hit a brick wall." In other words, there's only so much digital text for these bots to ingest, he told an interviewer last week from the International Telecommunication Union, a UN communications agency.

This may impact the way generative AI developers collect data and train their technologies in the coming years, but Russell still thinks AI will replace humans in many jobs that he characterized in the interview as "language in, language out."

Russell's predictions widen the growing spotlight being shone in recent weeks on the data harvesting conducted by OpenAI and other generative AI developers to train large language models, or LLMs.

The data-collection practices integral to ChatGPT and other chatbots are facing increased scrutiny, including from creatives concerned about their work being replicated without their consent and from social media executives disgruntled that their platforms' data is being used freely. But Russell's insights point toward another potential vulnerability: the shortage of text to train these datasets.

A study conducted last November by Epoch, a group of AI researchers, estimated that machine learning datasets will likely deplete all "high-quality language data" before 2026. Language data in "high-quality" sets comes from sources such as "books, news articles, scientific papers, Wikipedia, and filtered web content," according to the study.

The LLMs powering today's most popular generative AI tools were trained on massive amounts of published text culled from public online sources, including from digital news sources and social media sites. The "data scraping" of the latter is what drove Elon Musk to limit how many tweets users can view daily, he's said.

In an email to Insider, Russell said many reports, although unconfirmed, have detailed that OpenAI, the company behind ChatGPT, purchased text datasets from private sources. Russell added that while there are possible explanations for such a purchase, "the natural inference is that there isn't enough high-quality public data left."

OpenAI did not immediately respond to a request for comment ahead of publication.

Russell said in the interview that OpenAI, in particular, had to have "supplemented" its public language data with "private archive sources" to create GPT-4, the company's strongest and most advanced AI model to date. But he acknowledged in the email to Insider that OpenAI has yet to detail GPT-4's exact training datasets.

Several lawsuits filed against OpenAI in the past few weeks allege the company used datasets containing personal data and copyrighted materials to train ChatGPT. Among the biggest was a 157-page lawsuit filed by 16 unnamed plaintiffs, who claim OpenAI used sensitive data such as private conversations and medical records.

The latest legal challenge, presented by lawyers for comedian Sarah Silverman and two additional authors, accused OpenAI of copyright infringement due to ChatGPT's ability to write up accurate summaries of their work. Two additional authors, Mona Awad and Paul Tremblay, filed a lawsuit against OpenAI in late June that makes similar allegations.

OpenAI has not made any public comments on the slate of lawsuits filed against it. Its CEO Sam Altman has also refrained from discussing the allegations, but in the past has expressed a desire to avoid legal troubles.

At a June tech conference in Abu Dhabi, Altman told the audience he had no plans to issue an IPO for OpenAI, reasoning that the company's unorthodox structure and decision-making could lead to clashes with investors.

"I don't really want to be like sued by a bunch of like public market, Wall Street whatevers," Altman said.

Read the original article on Business Insider

Forget Nvidia: Members of Congress Are Scooping Up Shares of Its Core Rival Instead
There's a much more popular AI stock on Capitol Hill than the leading AI chip maker.
Motley Fool•1d ago
Fed's Powell, jobs report and Apple will rock markets this week
In addition to economic and earnings news, inflation and geopolitical worries will have roles to play.
TheStreet•6h ago
California Fast-Food Chains Are Now Serving Sticker Shock
A month after a higher state minimum wage for fast-food workers went into effect, consumers picking up burgers and burritos at chains in the Golden State grapple with prices rising at a faster clip than in other states.
The Wall Street Journal•8h ago
Former House Speaker Nancy Pelosi Nearly Tripled the S&P 500's Returns in 2023: Here Are the Stocks She's Been Buying
Nancy Pelosi's husband Paul Pelosi uses an investing strategy that lowers upfront costs.
Motley Fool•2h ago
Nvidia Owns a 3.4% Stake in This Innovative Artificial Intelligence (AI) Stock Cathie Wood Loves
The $50 million purchase could turn into Nvidia's biggest investment.
Motley Fool•23h ago
2 Incredibly Cheap Dividend Stocks to Buy Now
These two dividend payers have historically high yields and attractive businesses, even though there are some headwinds to deal with.
Motley Fool•14h ago
A Once-in-a-Generation Investment Opportunity: 1 Artificial Intelligence (AI) Growth Stock to Buy Now and Hold For a Decade
AI could generate trillions of dollars over the next ten years. One company is already an early winner of this windfall.
Motley Fool•3h ago
3 High-Yield Dividend Stocks That Could Soar 22% to 49%, According to Wall Street
Whether or not these stocks hit analysts' price targets, they all should be attractive to investors seeking great income and solid growth.
Motley Fool•2h ago
In 4 Days, Philip Morris Stock Becomes an Even Better Investment
The tobacco giant will soon be allowed to sell a proven product in a proven market.
Motley Fool•16h ago
If You Invested $10,000 in Warren Buffett's Top 3 Stocks 10 Years Ago, This Is How Much You'd Have Today
Buffett's top stocks are not only safe, but they have also generated some impressive returns.
Motley Fool•1h ago

News

Life

Entertainment

Finance

Sports

New on Yahoo

Yahoo Finance

Generative AI tools are quickly 'running out of text' to train themselves on, UC Berkeley professor warns

Recommended Stories

Forget Nvidia: Members of Congress Are Scooping Up Shares of Its Core Rival Instead

Fed's Powell, jobs report and Apple will rock markets this week

California Fast-Food Chains Are Now Serving Sticker Shock

Former House Speaker Nancy Pelosi Nearly Tripled the S&P 500's Returns in 2023: Here Are the Stocks She's Been Buying

Nvidia Owns a 3.4% Stake in This Innovative Artificial Intelligence (AI) Stock Cathie Wood Loves

2 Incredibly Cheap Dividend Stocks to Buy Now

A Once-in-a-Generation Investment Opportunity: 1 Artificial Intelligence (AI) Growth Stock to Buy Now and Hold For a Decade

3 High-Yield Dividend Stocks That Could Soar 22% to 49%, According to Wall Street

In 4 Days, Philip Morris Stock Becomes an Even Better Investment

If You Invested $10,000 in Warren Buffett's Top 3 Stocks 10 Years Ago, This Is How Much You'd Have Today