ChatGPT like AI models running out of text to train, claims UC Berkeley professor

AFP

Stuart Russell, an artificial intelligence expert and professor at University of California, has raised concerns about AI-powered language models, such as ChatGPT, potentially "running out of text in the universe" that is used to train them.

He explained that the technology behind AI bots, which rely on vast amounts of text data, is "starting to hit a brick wall".

Russell shared this insight during an interview with the International Telecommunication Union, a UN communications agency, last week. He emphasised that there is a finite amount of digital text available for these language models to consume.

The implications of this text scarcity may influence the future practices of generative AI developers as they collect data and train their technologies.

However, he maintained his belief that AI will increasingly replace humans in various language-dependent jobs. Russell referred to these jobs as "language in, language out" tasks during the interview. His comments contributed to the ongoing discussion surrounding data acquisition practices conducted by OpenAI and other developers of generative AI models.

Concerns have been raised by creators worried about their work being replicated without consent, as well as by social media executives dissatisfied with the unrestricted usage of their platforms' data. Russell's observations drew attention to another potential vulnerability: the scarcity of text available for training these datasets.

A study conducted by Epoch, a group of AI researchers, in November, revealed that machine learning datasets are likely to deplete all "high-quality language data" before 2026. The study defined "high-quality" language data as originating from sources like "books, news articles, scientific papers, Wikipedia, and filtered web content".

Today's most popular generative AI tools, powered by large language models (LLMs), were trained on massive amounts of published text extracted from public online sources, including digital news platforms and social media websites. The practice of "data scraping" from the latter was a contributing factor behind Elon Musk's decision to limit daily tweet views, as he previously stated.

Russell highlighted in the interview that OpenAI, in particular, had to supplement its public language data with "private archive sources" to develop GPT-4, the company's most robust and advanced AI model to date. However, he acknowledged in his email to Insider that OpenAI has yet to disclose the exact training datasets used for GPT-4. Recent lawsuits filed against OpenAI allege the use of datasets containing personal data and copyrighted materials in training ChatGPT. Notably, a prominent lawsuit was filed by 16 unnamed plaintiffs, asserting that OpenAI utilised sensitive data like private conversations and medical records.

Another lawsuit, involving comedian Sarah Silverman and two additional authors, accused OpenAI of copyright infringement due to ChatGPT's capability to generate accurate summaries of their work. Authors Mona Awad and Paul Tremblay also filed a similar lawsuit against OpenAI in late June.

More from Business News

On Virgin Radio today

  • Adam Eddine

    8:00am - 11:00am

    Playing 10 hits in a row every hour, all weekend!

  • Ryan Seacrest

    11:00am - 2:00pm

    Playing 10 hits in a row every hour, all weekend!

Trending on Virgin Radio

  • Pick the Hits

    We have money can't buy passes to premiere of Bad Boys: Ride Or Die and see Will Smith and Martin Lawrence in person at the event! Just Pick The Hits!

  • Ed Sheeran - UAE EXCLUSIVE

    Latest album, what happened with BTS, raising a baby girl, eye surgery, and more!

  • Bassem Youssef

    One of the most talked about comedians in the world takes over The Kris Fade Show. Watch the full interview here...

  • Untold Dubai

    It arrived: The UAE's First Mega Music Festival at Expo City Dubai!