Google Leaked Itself ⇥ sparktoro.com

Rand Fishkin, writing on the SparkToro blog:

On Sunday, May 5th, I received an email from a person claiming to have access to a massive leak of API documentation from inside Google’s Search division. The email further claimed that these leaked documents were confirmed as authentic by ex-Google employees, and that those ex-employees and others had shared additional, private information about Google’s search operations.

It seems this vast amount of information was published erroneously by Google to a GitHub repository in March, and then removed earlier this month. As Fishkin writes, it is evidence Google has been dishonest in its public statements about how Google Search works.

Fishkin specifically calls attention to media outlets that cover search engines and value the word of Google’s spokespeople. This has been a clever play by Google for years: because its specific ranking criteria have not been publicly known, it can confirm or deny rumours without having to square them with what the evidence shows.

Google’s ranking system seems to be biased in favour of larger businesses and more established websites, according to Fishkin’s analysis. This is not surprising. I am wondering how this fits with the declining quality of Google search results as small, highly-optimized pages full of machine-generated junk seem to rise to the top.

Mike King, iPullRank:

You’d be tempted to broadly call these “ranking factors,” but that would be imprecise. Many, even most, of them are ranking factors, but many are not. What I’ll do here is contextualize some of the most interesting ranking systems and features (at least, those I was able to find in the first few hours of reviewing this massive leak) based on my extensive research and things that Google has told/lied to us about over the years.

“Lied” is harsh, but it’s the only accurate word to use here. While I don’t necessarily fault Google’s public representatives for protecting their proprietary information, I do take issue with their efforts to actively discredit people in the marketing, tech, and journalism worlds who have presented reproducible discoveries. My advice to future Googlers speaking on these topics: Sometimes it’s better to simply say “we can’t talk about that.” Your credibility matters, and when leaks like this and testimony like the DOJ trial come out, it becomes impossible to trust your future statements.

One of the things potentially tracked by Google for search purposes is Chrome browsing data, something Google has denied. The variable in question — chromeInTotal — and the minimal description offered — “site-level Chrome views” — seem open to interpretation. Perhaps this is only recorded in some circumstances, or it depends on user preferences, or is not actually part of search rankings, or is entirely unused. But it certainly suggests aggregate website visits in Chrome, the world’s most popular web browser, are used to inform rankings without users’ knowledge.

Update: Google says the leaked documents are real, but warns “against making inaccurate assumptions”. In fairness, I would like to make more accurate assumptions.