When All You Have Is a Robots.txt Hammer

Last year, Robb Knight figured out how Perplexity, an artificial intelligence search engine, was evading instructions not to crawl particular sites. Knight learned that Perplexity’s engine would use an unlisted user agent to scrape summaries of pages on websites where Perplexity was blocked. In my testing, I found the summaries were outdated by hours-to-days, indicating to me the pages were not being actively visited as though guided by a user. Aravind Srinivas, CEO of Perplexity, told Mark Sullivan, of Fast Company, it was the fault of a third-party crawler and denied wrongdoing.

This dispute was, I think, a clear marker in a debate concerning what control website owners have — or ought to have — over access to and interpretation of their websites, an issue that was recently re-raised in an article by Mike Masnick of Techdirt. Masnick explores scraper gating services offered by Cloudflare and Reddit’s blocking of the Internet Archive, and concludes the web is being cleaved in two:

There are plenty of reasons to be concerned about LLM/AI tools these days, in terms of how they can be overhyped, how they can be misused, and certainly over who has power and control over the systems. But it’s deeply concerning to me how many people who supported an open internet and the fundamental principles that underlie that have now given up on those principles because they see that some AI companies might benefit from an open internet.

The problem isn’t just ideological — it’s practical. We’re watching the construction of a fundamentally different internet, one where access is controlled by gatekeepers and paywalls rather than governed by open protocols and user choice. And we’re doing it in the name of stopping AI companies, even though the real result will be to concentrate even more power in the hands of those same large tech companies while making the internet less useful for everyone else.

This is a passionately argued article about a thorny issue. I, too, am saddened by an increasingly walled-off web, whether through payment gates or the softer barriers of login or email subscriptions. Yet Masnick misses the mark in ways I think he is usually more careful about.

In the second quoted paragraph above, for example, Masnick laments an internet “governed [less] by open protocols and user choice” than “controlled by gatekeepers”. These are presented as opposing qualities, but they are in fact complementary. Open protocols frequently contain specifications for authentication, allowing users and administrators to limit access. Robots.txt is an open standard that is specifically intended to communicate access rules. Thus, while an open web is averse to centralization and proprietary technologies, it does not necessarily mean a porous web. The open web does not necessarily come without financial cost to human users. I see no reason the same principle should not be applied to robots, too.

Masnick:

This illustrates the core problem: we’re not just blocking bulk AI training anymore. We’re blocking legitimate individual use of AI tools to access and analyze web content. That’s not protecting creator rights — that’s breaking the fundamental promise of the web that if you publish something publicly, people should be able to access and use it.

Masnick is entirely correct: people should be able to access and use it. They should be able to use any web browser they like, with whatever browser extensions and user scripts they desire. That does not necessarily extend to machines. The specific use case Masnick is concerned with is that he uses Lex as a kind of editorial verification step. When he references some news sites, however, Lex is blocked from reading them and therefore cannot provide notes on whether Masnick’s interpretation of a particular article is accurate. “I’m not trying to train an A.I. on those articles”, Masnick writes. “I’m just asking it to read over the article, read over what I’ve written, and give me a sense” if they jibe.

That may well be the case, but the blame for mistrust lies squarely with artificial intelligence companies. The original sin of representatives of this industry was to believe they did not require permission to ingest a subset of the corpus of human knowledge and expression, nor did they need to offer compensation. They did not seem to draw hard ethical lines around what they would consume for training, either — if it was publicly available, it could become part of their model. Anthropic and Meta both relied on materials available at LibGen, many of which are hosted without permission. A training data set included fan-made subtitles, which can be treated as illicit derivative works. I cannot blame any publisher for treating these automated visitors as untrustworthy or even hostile because A.I. companies have sabotaged attempts at building trust. Some seem to treat the restrictions of a robots.txt file as mere suggestions to be worked around. How can a publisher be confident the user-initiated retrieval of their articles, as Masnick is doing, is not used for training in any way?

Masnick is right, however, to be worried about how this is bifurcating the web. Websites like 404 Media have explicitly cited A.I. scraping as the reason for imposing a login wall. A cynical person might view this as a convenient excuse to collect ever-important email addresses and, while I cannot disprove that, it is still a barrier to entry. Then there are the unintended consequences of trying to impose limits on scraping. After Reddit announced it would block the Internet Archive, probably to comply with some kind of exclusivity expectations in its agreements with Google and OpenAI, it implied the Archive does not pass along the robots.txt rules of the sites in its collection. If a website administrator truly does not want the material on their site to be used for A.I. training, they would need to prevent the Internet Archive from scraping as well — and that would be horrible consequence.

Of course, Reddit does not block A.I. scraping on principle. It appears to be a contractual matter, where third-parties pay the company some massive amount of money for access. Anthropic’s recent proposed settlement supposed a price of a billion-and-a-half dollars would sufficiently compensate authors of the books it pirated. M.G. Siegler called this “pulling up a drawbridge” by setting a high cost floor that will lock out insufficiently funded competitors. Masnick worries about the same thing, predicting the ultimate winners of this will be “the same large tech companies that can afford licensing deals and that have the resources to navigate an increasingly complex web of access restrictions”.

To be sure, intellectual property law is a mess, and encouraging copyright maximalism will have negative consequences. The U.S. already has some of the longest copyright protections in the world, and which have unfortunately spilled into Canada thanks to trade agreements. But A.I. organizations have not created a bottom-up rebellious exploration of the limits of intellectual property law. They are big businesses with deep pockets exploiting decades of news, blogging, photography, video, and art. Nobody, as near as makes no difference, expected something they published online would one day feed the machines that now produce personalized Facebook slop.

Masnick acknowledges faults like these in his conclusion, but I do not think his proposed solutions are very strong:

None of this means we should ignore legitimate concerns about AI training or creator compensation. But we should address those concerns through mechanisms that preserve internet openness rather than destroy it. That might mean new business models, better attribution systems, or novel approaches to creator compensation. What it shouldn’t mean is abandoning the fundamental architecture of the web.

The “new business models” and “better attribution systems” are not elucidated here, but the compensation pitch seems like a disaster in the making to me. It is also from Masnick; here is the nut of his explanation:

But… that doesn’t mean there isn’t a better solution. If the tech companies need good, well-written content to fill their training systems, and the world needs good, high-quality journalism, why don’t the big AI companies agree to start funding journalists and solve both problems in one move?

What Masnick proposes is that A.I. companies could pay journalists to produce new articles for their training data. Respectfully, this would be so insubstantial as to be worthless. To train their models, A.I. companies are ingesting the millions of websites, tens of millions of YouTube videos, hundreds of thousands of books, and probably far more — the training data is opaque. It is almost like a perverse version of fair use. Instead of a small amount of an existing work becoming the basis of a larger body of work — like the quotes I am using and attributing in this article — this is a massive library of fully captured information. Any single piece is of little consequence to the whole, but the whole does not work as well without all those tiny pieces.

The output of a single journalist is inconsequential, an argument Masnick also makes: “[a]ny individual piece of content (or even 80k pieces of content) is actually not worth that much” in the scope of training a large language model. This is near the beginning of the same piece he concludes by arguing we need “novel approaches to creator compensation”. Why would A.I. companies pay journalists to produce the microscopic portion of words training their systems when they have historically used billions — perhaps trillions — of freebies? There are other ways I can think of why this would not work, but this is the most obvious.

One thing that might help, not suggested by Masnick, is improving the controls available to publishers. Today marked the launch of the Really Simple Licensing standard offering publishers a way to define machine-readable licenses. These can be applied site-wide, sure, but also at a per-page level. It is up to A.I. companies to adhere to the terms but with an exception — there are ways to permit access to encrypted material. This raises concerns about a growing proliferation of digital rights management, bringing me back to Masnick’s reasonable concern about a web increasingly walled-off and accessible only to authorized visitors.

I am not saying I have better ideas; I appreciate that Masnick at least brought something to the table in that regard, as I have nothing to add. I, too, am concerned about dividing the web. However, I think publishers are coming at this from a reasonable place. This is not, as Masnick puts it, a “knee-jerk, anti-A.I. stance” to which publishers have responded with restrictions because “[i]f it hurts A.I. companies, it must be good”. A.I. companies largely did this to themselves by raising billions of dollars in funding to strip-mine the public web without permission and, ultimately, with scant acknowledgement. I believe information should be freer than it is, that intellectual property hoarding is wrong, and that we are better when we build on top of each other. That is a fine stance for information reuse by fellow human beings. However, the massive scale of artificial intelligence training comes with different standards.

In writing this article, I am acutely aware it will become part of a training data set. I could block those crawlers — I have blocked a few — but that is only partly the point. I simply do not know how much control I reclaim now will be relevant in the future, and I am sure the same is true of any real media organization. I write here for you, not for the benefit of building the machines producing a firehose of spam, scams, and slop. The artificial intelligence companies have already violated the expectations of even a public web. Regardless of the benefits they have created — and I do believe there are benefits to these technologies — they have behaved unethically. Defensive action is the only control a publisher can assume right now.