Reddit to Block the Internet Archive Due to Unauthorized Scraping ⇥ theverge.com

Jay Peters, the Verge:

Reddit says that it has caught AI companies scraping its data from the Internet Archive’s Wayback Machine, so it’s going to start blocking the Internet Archive from indexing the vast majority of Reddit. The Wayback Machine will no longer be able to crawl post detail pages, comments, or profiles; instead, it will only be able to index the Reddit.com homepage, which effectively means Internet Archive will only be able to archive insights into which news headlines and posts were most popular on a given day.

Surely, this has something to do with Reddit’s decision to license the data created by its users, as Peters writes, but it also puts the Internet Archive in an uncomfortable middle seat with a massive trove of third-party data. Unfortunately for many publishers, the Archive seems to be perfectly happy with scrapers and is unbothered if its collection is used to train artificial intelligence. While the Wayback Machine preserves a copy of a website’s robots.txt file, any publisher serious about restricting A.I. training on their material must also block the Internet Archive for fear this could happen to them. That would be a terrible loss for all of us.