Meta Allegedly Trained A.I. On a Hundred Terabytes of Pirated Books ⇥ arstechnica.com
Even though a 2023 class action suit filed by authors against Meta has been shaky so far, some of the details in what is left of the suit are stunning. Apparently, Meta downloaded a hundred terabytes of pirated books, according to documents recently unsealed.
Ashley Belanger, Ars Technica:
Supposedly, Meta tried to conceal the seeding by not using Facebook servers while downloading the dataset to “avoid” the “risk” of anyone “tracing back the seeder/downloader” from Facebook servers, an internal message from Meta researcher Frank Zhang said, while describing the work as in “stealth mode.” Meta also allegedly modified settings “so that the smallest amount of seeding possible could occur,” a Meta executive in charge of project management, Michael Clark, said in a deposition.
Now that new information has come to light, authors claim that Meta staff involved in the decision to torrent LibGen must be deposed again, because allegedly the new facts “contradict prior deposition testimony.”
Mark Zuckerberg, for example, claimed to have no involvement in decisions to use LibGen to train AI models. But unredacted messages show the “decision to use LibGen occurred” after “a prior escalation to MZ,” authors alleged.
It should surprise nobody that A.I. is trained on illicit material. Even if you believe A.I. training through bulk web scraping is a perfectly legitimate expression of free use, it is obviously going to run across things which are posted illegally. There are entire blockbuster movies on video platforms; photos and books get reshared without permission constantly.
If Meta or any other A.I. company had bothered to license this data from its copyright holders, it would be less likely to ingest pirated material. That would, of course, be expensive and slow. But Meta, as of writing, posted the world’s seventh highest earnings in its 2024 fiscal year: over $71 billion. I think it can afford to pay for the data it harvests.