Artificial Intelligence Should Bear Responsibility for Its Costs ⇥ drewdevault.com
Now it’s LLMs. If you think these crawlers respect robots.txt then you are several assumptions of good faith removed from reality. These bots crawl everything they can find, robots.txt be damned, including expensive endpoints like git blame, every page of every git log, and every commit in every repo, and they do so using random User-Agents that overlap with end-users and come from tens of thousands of IP addresses – mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure – actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic.
As curious and fascinating as I find many applications of generative artificial intelligence, I find it difficult to square with the flagrantly unethical way it has been trained. Server admins have to endure and pay for massive amounts of traffic from well-funded corporations, without compensation, all of which treat robots.txt
as something to be worked around. Add to that the kind of copyright infringement that would cost users thousands of dollars per file, and it is clear the whole system is morally bankrupt.
Do not get me wrong — existing intellectual property law is in desperate need of reform. Big, powerful corporations have screwed us all over by extending copyright terms. In Canada, the number of works in the public domain will be stagnant for the next eighteen years after we signed onto the Canada–United States–Mexico Agreement. But what artificial intelligence training is proposing is a worst-of-both-worlds situation, in which some big businesses get to retain a tight grip on artists’ works, and others get to assume anything remotely public is theirs to seize.