OpenAI’s House Counsel to Be Deposed Over Deleted Pirated Material hollywoodreporter.com

Winston Cho, the Hollywood Reporter:

To rewind, authors and publishers have gained access to Slack messages between OpenAI’s employees discussing the erasure of the datasets, named “books 1 and books 2.” But the court held off on whether plaintiffs should get other communications that the company argued were protected by attorney-client privilege.

In a controversial decision that was appealed by OpenAI on Wednesday, U.S. District Judge Ona Wang found that OpenAI must hand over documents revealing the company’s motivations for deleting the datasets. OpenAI’s in-house legal team will be deposed.

Wang’s decision (PDF), to the extent I can read it as a layperson, examines OpenAI’s shifting story about why it erased the books 1 and books2 data sets — apparently, the only time possible training materials were deleted.

I am not sure it has yet been proven OpenAI trained its models on pirated books. Anthropic settled a similar suit in September, and Meta and Apple are facing similar accusations. For practical purposes, however, it is trivial to suggest it did use pirated data in general: if you have access to its Sora app, enter any prompt followed by the word “camrip”.

What is a camrip?, a strictly law-abiding person might ask. It is a label added to a movie pirated in the old-fashioned way: by pointing a video camera at the screen in a theatre. As a result, these videos have a distinctive look and sound which is reproduced perfectly by Sora. It is very difficult for me to see a way in which OpenAI could have trained this model to understand what a camrip is without feeding it a bunch of them, and I do not know of a legitimate source for such videos.