YouTube Subtitles Included in Large Data Set Used to Train Notable A.I. Models proofnews.org

Annie Gilbertson and Alex Reisner, Proof:

AI companies are generally secretive about their sources of training data, but an investigation by Proof News found some of the wealthiest AI companies in the world have used material from thousands of YouTube videos to train AI. Companies did so despite YouTube’s rules against harvesting materials from the platform without permission.

Our investigation found that subtitles from 173,536 YouTube videos, siphoned from more than 48,000 channels, were used by Silicon Valley heavyweights, including Anthropic, Nvidia, Apple, and Salesforce.

According to Gilbertson and Reisner, this is a data set called — appropriately enough — “YouTube Subtitles”, which is part of a larger set called the “Pile”, which is distributed by EleutherAI. The “Pile” was used by Apple to train OpenELM.

Chance Miller, 9to5Mac:

Apple has now confirmed to 9to5Mac, however, that OpenELM doesn’t power any of its AI or machine learning features – including Apple Intelligence.

Lance Ulanoff, TechRadar:

While not speaking directly to the issue of YouTube data, Apple reiterated its commitment to the rights of creators and publishers and added that it does offer websites the ability to opt out of their data being used to train Apple Intelligence, which Apple unveiled during WWDC 2024 and is expected to arrive in iOS 18.

The company also confirmed that it trains its models, including those for its upcoming Apple Intelligence, using high-quality data that includes licensed data from publishers, stock images, and some publicly available data from the web. YouTube’s transcription data is not intended to be a public resource but it’s not clear if it’s fully hidden from view.

Even if you set aside the timing of allowing people to opt out, it scarcely matters in this case. If YouTube captions were part of the data set used to train any part of Apple Intelligence, it would be impossible for channel operators to opt out because they cannot set individualized robots.txt instructions.

Five New York Times reporters wrote in April about the tension OpenAI created after it began transcribing YouTube videos:

Some Google employees were aware that OpenAI had harvested YouTube videos for data, two people with knowledge of the companies said. But they didn’t stop OpenAI because Google had also used transcripts of YouTube videos to train its A.I. models, the people said. That practice may have violated the copyrights of YouTube creators. So if Google made a fuss about OpenAI, there might be a public outcry against its own methods, the people said.

I could not find any mechanism to opt one’s own YouTube videos out of A.I. training. This is one of the problems of YouTube being a singular destination for general-purpose online video: it has all the power and, by extension, so does Google.

By the way, I am still waiting for someone in Cupertino to check the Applebot inbox.