ChatGPT like AI models running out of text to train, claims UC Berkeley professor

Thursday, 13 July 2023 10:00

By ARN News Staff

Stuart Russell, an artificial intelligence expert and professor at University of California, has raised concerns about AI-powered language models, such as ChatGPT, potentially "running out of text in the universe" that is used to train them.

He explained that the technology behind AI bots, which rely on vast amounts of text data, is "starting to hit a brick wall".

Russell shared this insight during an interview with the International Telecommunication Union, a UN communications agency, last week. He emphasised that there is a finite amount of digital text available for these language models to consume.

The implications of this text scarcity may influence the future practices of generative AI developers as they collect data and train their technologies.

However, he maintained his belief that AI will increasingly replace humans in various language-dependent jobs. Russell referred to these jobs as "language in, language out" tasks during the interview. His comments contributed to the ongoing discussion surrounding data acquisition practices conducted by OpenAI and other developers of generative AI models.

Concerns have been raised by creators worried about their work being replicated without consent, as well as by social media executives dissatisfied with the unrestricted usage of their platforms' data. Russell's observations drew attention to another potential vulnerability: the scarcity of text available for training these datasets.

A study conducted by Epoch, a group of AI researchers, in November, revealed that machine learning datasets are likely to deplete all "high-quality language data" before 2026. The study defined "high-quality" language data as originating from sources like "books, news articles, scientific papers, Wikipedia, and filtered web content".

Today's most popular generative AI tools, powered by large language models (LLMs), were trained on massive amounts of published text extracted from public online sources, including digital news platforms and social media websites. The practice of "data scraping" from the latter was a contributing factor behind Elon Musk's decision to limit daily tweet views, as he previously stated.

Russell highlighted in the interview that OpenAI, in particular, had to supplement its public language data with "private archive sources" to develop GPT-4, the company's most robust and advanced AI model to date. However, he acknowledged in his email to Insider that OpenAI has yet to disclose the exact training datasets used for GPT-4. Recent lawsuits filed against OpenAI allege the use of datasets containing personal data and copyrighted materials in training ChatGPT. Notably, a prominent lawsuit was filed by 16 unnamed plaintiffs, asserting that OpenAI utilised sensitive data like private conversations and medical records.

Another lawsuit, involving comedian Sarah Silverman and two additional authors, accused OpenAI of copyright infringement due to ChatGPT's capability to generate accurate summaries of their work. Authors Mona Awad and Paul Tremblay also filed a similar lawsuit against OpenAI in late June.