Apple, NVIDIA and Anthropic reportedly used YouTube transcripts without permission to train AI models


Some of the world’s biggest tech companies trained AI models without permission on a database containing the transcripts of more than 173,000 YouTube videos. new research from Evidence News found. Created by a nonprofit company called EleutherAI, the dataset contains transcripts of YouTube videos from more than 48,000 channels and has been used by Apple, NVIDIA, and Anthropic, among other companies. The findings of the study bring into focus the disturbing reality of artificial intelligence: the technology is largely built on the back of data stolen from creators without their consent or compensation.

The dataset does not include any videos or images from YouTube, but does include video transcripts from the platform’s biggest creators, including Marques Brownlee and MrBeast, as well as major news outlets. The New York Timesthe BBCand ABC News. Subtitles from Engadget videos are also part of the database.

“Apple has acquired data for artificial intelligence from several companies,” Brownlee said Posted in X. “One of them removed tons of data/transcripts from YouTube videos, including mine,” he said. “It’s going to be an evolving problem for a long time.”

YouTube, Apple, NVIDIA, Anthropic and EleutherAI did not respond to Engadget’s request for comment.

Until now, AI companies have not been transparent about the data used to train their models. Earlier this month, artists and photographers He criticized Apple for not disclosing the source of the training data for Apple Intelligencethe company has generative artificial intelligence coming to millions of Apple devices this year.

YouTube in particular, the world’s largest video repository, is a goldmine of not only transcripts but also audio, video and images, making it an attractive database for training AI models. Earlier this year, OpenAI’s chief technology officer, Mira Murati, dodged their questions from The Wall Street Journal about whether the company uses YouTube videos for training Sora, OpenAI’s upcoming AI video creation tool. “I’m not going to go into the details of the data that was used, but it was publicly available or licensed data,” Murati said. Both YouTube CEOs Neal Mohan and CEO of Alphabet Sundar Pichai companies using YouTube data to train their AI models is a violation of the platform’s terms of service, they said.

If you want to see if subtitles from your YouTube videos or favorite channels are part of the dataset, go to Proof News. search tool.





Source link

Leave a Reply

Your email address will not be published. Required fields are marked *