The Internet Archive and Robots.txt inkdroid.org

Mark Graham of the Internet Archive:

Robots.txt files were invented 20+ years ago to help advise “robots,” mostly search engine web crawlers, which sections of a web site should be crawled and indexed for search.

Many sites use their robots.txt files to improve their SEO (search engine optimization) by excluding duplicate content like print versions of recipes, excluding search result pages, excluding large files from crawling to save on hosting costs, or “hiding” sensitive areas of the site like administrative pages. (Of course, over the years malicious actors have also used robots.txt files to identify those same sensitive areas!) Some crawlers, like Google, pay attention to robots.txt directives, while others do not.

Over time we have observed that the robots.txt files that are geared toward search engine crawlers do not necessarily serve our archival purposes. Internet Archive’s goal is to create complete “snapshots” of web pages, including the duplicate content and the large versions of files.

Ed Summers:

Up until now the Internet Archive have used the robots.txt in two ways:

  • their ia_archiver web crawler consults a publisher’s robots.txt to determine what parts of a website to archive and how often

  • the Wayback Machine (the view of the archive) consults the robots.txt to determine what to allow people to view from the archived content it has collected.

If the Internet Archive’s blog post is read at face value it seems like they are going to stop doing these things altogether, not just for government websites, but for the entire web. While conversation in Twitter makes it seem like this is a great idea whose time has come, I think this would be a step backwards for the web and for its most preeminent archive, and I hope they will reconsider or take this as an opportunity for a wider discussion.

I get where Graham is coming from here. The Internet Archive is supposed to be a snapshot of the web as it was at any given time, and if a robots.txt file prevents them from capturing a page or a section of a website that would normally be visible to a user, that impairs their mission.

But, much as I love the Internet Archive, I think Summers’ criticism is entirely valid: ignoring robots.txt files would violate website publishers’ wishes. It’s as simple as that. Even though I wish FFFFOUND didn’t block the Internet Archive from capturing the site, I think that request should be respected by the Archive. Robots.txt is a simple, straightforward format for publishers to designate which areas of their site are off-limits to scrapers and crawlers, and that should be respected.