Publishers Are Restricting Internet Archive, Access Attempting to Avoid A.I. Scraping ⇥ niemanlab.org

Andrew Deck and Hanaa’ Tameez, NiemanLab:

“A lot of these AI businesses are looking for readily available, structured databases of content,” he [the Guardian’s Robert Hahn] said. “The Internet Archive’s API would have been an obvious place to plug their own machines into and suck out the IP.” (He admits the Wayback Machine itself is “less risky,” since the data is not as well-structured.)

As news publishers try to safeguard their contents from AI companies, the Internet Archive is also getting caught in the crosshairs. The Financial Times, for example, blocks any bot that tries to scrape its paywalled content, including bots from OpenAI, Anthropic, Perplexity, and the Internet Archive. The majority of FT stories are paywalled, according to director of global public policy and platform strategy Matt Rogerson. As a result, usually only unpaywalled FT stories appear in the Wayback Machine because those are meant to be available to the wider public anyway.

Hahn may find the Wayback Machine “less risky” than the official API, but that was the reason Reddit cited when it blocked the Internet Archive last year. I feared this likely outcome. Publishers’ understandable desire to control the use of their work is going to make the Internet Archive less useful because neither A.I. scrapers nor the Internet Archive matches the robots.txt rules at the original domain with their policies on archival websites.