Automattic Is Doing Some Weird Stuff With Users’ Public Data ⇥ 404media.co

Jason Koebler and Samantha Cole, 404 Media:

Almost every platform has some sort of post “firehose,” API, or way of accessing huge amounts of user posts. Famously, Twitter and Reddit used to give these away for free. Now they do not, and charging access for these posts has become big business for those companies. This is just to say that the existence of Automattic’s firehose is not anomalous in an internet ecosystem that trades on data. But this firehose also means that the average user doesn’t and can’t know what companies are getting direct access to their posts, and what they’re being used for.

I am not particularly surprised to learn that public posts on WordPress.com blogs are part of a massive feed, but I am shocked it is not as obvious that self-hosted WordPress sites with Jetpack installed are automatically opted into it as well. For something as popular as Jetpack is — over five million users, according to its WordPress.org installation page — I was surprised by how infrequently this has been mentioned: aside from privacy policies and official documentation, I found a 2013 article on the Next Web, a Reddit comment from a few years ago, and a handful of content marketing specialists suggesting it helps with search optimization.

After avoiding questions from 404, Automattic says it is “winding down” firehose access.

Samantha Cole, 404 Media:

Tumblr and WordPress.com are preparing to sell user data to Midjourney and OpenAI, according to a source with internal knowledge about the deals and internal documentation referring to the deals.

The exact types of data from each platform going to each company are not spelled out in documentation we’ve reviewed, but internal communications reviewed by 404 Media make clear that deals between Automattic, the platforms’ parent company, and OpenAI and Midjourney are imminent.

Automattic:

We currently block, by default, major AI platform crawlers — including ones from the biggest tech companies — and update our lists as new ones launch.

[…]

We are also working directly with select AI companies as long as their plans align with what our community cares about: attribution, opt-outs, and control.

We will only share public content that’s hosted on WordPress.com and Tumblr, and only from sites that haven’t opted out.

We are not including content from sites hosted elsewhere even if they use Automattic plugins like Jetpack or WooCommerce.

I am not sure which crawlers are currently being blocked or how that is being accomplished, but it does not appear to be in WordPress blogs’ robots.txt files.

The New York Times comprehensively blocks known machine learning crawlers, which you can verify by viewing its robots.txt file; the crawlers we are interested in are listed near the bottom, just above all the sitemaps. That is also true for Tumblr. But when I checked a bunch of WordPress.com sites at random — by searching “site:wordpress.com inurl:2024” — I found much shorter automatically generated robots.txt files, similar to WordPress’ own. I am not sure why I could not find a single WordPress.com blog with the same opt-out signal.

What is implied in Automattic’s disclosure is how it is preparing to switch Tumblr and WordPress blogs from the current opt-in model to an opt-out one. Both platforms have been popular among artists and I am not sure they would expect their contributions to become fodder for machines.

Then again, that is true for everybody who has ever posted anything on the web: it is all training data now, unless you can explicitly say otherwise.