A Popular A.I. Training Set Includes Data from OpenSubtitles ⇥ theatlantic.com

Alex Reisner, the Atlantic:

I can now say with absolute confidence that many AI systems have been trained on TV and film writers’ work. […]

The files within this data set are not scripts, exactly. Rather, they are subtitles taken from a website called OpenSubtitles.org. Users of the site typically extract subtitles from DVDs, Blu-ray discs, and internet streams using optical-character-recognition (OCR) software. Then they upload the results to OpenSubtitles.org, which now hosts more than 9 million subtitle files in more than 100 languages and dialects. […]

The Atlantic has built a search engine of subtitles used in training. This is in addition to — but in the same data set as — YouTube subtitles.

The files provided by websites like OpenSubtitles are, to my knowledge, not exactly legal. Courts in Australia and the Netherlands have treated them as distinct works protected by copyright. I am not arguing this is correct — fan-created subtitles are useful and can permit more translation options — but it is noteworthy for these models to be trained not only on original works without explicit permission, but also on derivative works made illegally.

Put it this way: would it be right if models used for generating movies were trained on a corpus of pirated movies, or music to be trained on someone’s LimeWire collection? It arguably does not matter whether copyright holders were paid for the single copy used in training materials, since it is a derivative created without permission in either case. But it feels a tiny bit worse to know generative models were trained using illicit subtitles instead of quasi-legitimate ones.