Friday, November 22, 2024

Google’s precious Search recipe exposed in huge leak (Update: Google’s statement)

Must read

Robert Triggs / Android Authority

TL;DR

  • Many of the ranking factors involved in Google Search’s super-secret algorithm have purportedly leaked.
  • This leak sheds light on how Google Search seemingly operates and which attributes it uses to rank content on the Search Engine Results Page.
  • However, the findings from the leaked document do not align with Google’s statements on these topics over the years.

Update: May 30, 2024 (12:22 AM ET): In a statement shared with The Verge, Google has confirmed in a roundabout way that the leaked documents are real, albeit they could be outdated or incomplete.

We would caution against making inaccurate assumptions about Search based on out-of-context, outdated, or incomplete information. We’ve shared extensive information about how Search works and the types of factors that our systems weigh, while also working to protect the integrity of our results from manipulation.


Original article: May 29, 2024 (04:12 AM ET): Before Google jumped in deep with AI Overviews and Search Generative Experience, it banked extremely heavily on serving users search queries through the conventional Google Search results we know. However, a lot goes on behind the scenes in answering such consumer search queries. Google has kept its search engine ranking secret sauce close to its heart and has always instead presented best practices to guide websites. Now, a leak claims to have unraveled the truth behind Google’s highly coveted Search algorithm, and in a lot of ways, it showcases how the company’s guidance doesn’t match what it seemingly checks for.

The news: Google’s Search algorithm has purportedly leaked

SparkToro claims to have accessed more than 2,500 pages of API documentation that originate from Google’s internal “Content API Warehouse.”

The report mentions that the documentation was inadvertently leaked on GitHub in March 2024 but then removed. However, you can spot copies of v0.4.0 and v0.5.0 of google_api_content_warehouse on Hexdocs (we at Android Authority are unable to verify the authenticity of these leaked documents, so reader discretion is advised).

The documentation appears to be part of Google Search’s secret sauce, aka the algorithm. It doesn’t directly show the weight that Search’s ranking system assigns to different characteristics of a website or its content, but it does show the details that Google collects from websites and web pages. The initial report then collaborates with iPullRank to analyze the purported APIs.

This leak is practically the most significant insight ever into how Google Search actually works. Surprisingly, it also contradicts much of what Google has publicly stated. To appreciate what’s wrong, we’ll have to look at Google Search’s behind-the-scenes workings.

The background: What happens behind the scenes when you Google Search?

Circle to Search on a Pixel 7 Pro

Hadlee Simons / Android Authority

A Google Search query may seem like an innocent and inconsequential action to a consumer like you, but it is oiling the wheels of a multi-million dollar industry. So, to understand the gravity of the leak, it is crucial to understand what happens when you do a Google Search.

The basics: Search engines, web crawling, web indexing, and ranking search results

When users have questions that they want answered on the internet, they approach a website called a “search engine.” They input a query for the search engine to look up, and the search engine presents them with an answer that hopefully answers their question. Simple, right?

On the back of it, the search engine does a lot of work, but it can be broken down into three main tasks:

  • Crawling: A search engine needs to know the entirety of the internet’s data to find out who is answering what and what is answered where. For this, a search engine “crawls” the entire internet, i.e., it visits every single website and webpage.
  • Indexing: The pages the crawler has visited are analyzed for their data and content, and this information is stored in an easy-to-retrieve manner.
  • Ranking: Since hundreds and thousands of websites are trying to answer the same query, there needs to be a system that showcases who is presented first to the user. This is commonly referred to as a ranking system. The most visible form of it is the position on which a website appears on a search engine result page (SERP).

The ranking system decides who is placed in the first spot, who is placed on the first page, which combinations of search terms their specific articles land, and so on.

Why does ranking matter on Google Search or any search engine?

Google Search (or simply Google) is the largest search engine in the world, controlling the vast majority of the search traffic that routes through the internet. Just go ahead and count the number of search queries you run in an ordinary day or week and multiply that by billions of people worldwide. Suddenly, you can see why search engines are often called the traffic signals of the internet, as they have the power to route massive internet traffic down your internet road if you do things right.

There’s insane potential to make money for your business if and when you sit at the first position of a popular SERP. Most users only click on the first result, and traffic inflow dries down further down the list by many magnitudes.

Google Search results for Best Phone

Aamir Siddiqui / Android Authority

Do you recall the last time you scrolled on Google Search and clicked on the second, third, fourth, or fifth result? You only do that when the first result doesn’t satisfy you, and more often than not, you’d be changing and refining your search query before you even go through all the results on the first page of Google Search.

Do you recall the last time you went to page two of a Google Search result? You actually won’t, as Google has removed pagination and opted for a continuous scroll for Search. But the truth is that most users don’t go beyond the first handful of answers. They are either satisfied or have changed their query.

Google’s secret sauce: The Google Search algorithm

Search Trip ideas with generative AI

So, there’s a lot of pressure to do things right. But how do you do things right?

It would be nice to have a look at Google’s ranking system, aka the Google Search algorithm. That way, websites can do exactly what Google is looking for. They can then consistently rank at the top of search queries, get billions of views, and make millions of dollars.

But the problem is also the same: everyone would know what Google is looking for, and because millions of dollars of ad and affiliate revenue would be at stake, they would have a very strong incentive to game the results to the detriment of end-user experience.

Until recently, most of us would agree that as users, Google Search has been our primary means of finding new information online. Whatever Google has been doing with its secret sauce has been working.

Google’s public recipe: E-E-A-T guidelines for people-first content

Instead of directly publishing its secret sauce, Google publishes a public recipe in the form of content guidelines for websites like ours to follow when we publish our content.

There is a lot of depth to them built up over the years, but Google has always strongly advised creating “people-first content,” aka content for end users, instead of content for a search engine. Google wants you to leave the heavy lifting of ranking to the Search algorithm and just focus on creating content that demonstrates aspects of experience, expertise, authoritativeness, and trustworthiness, or E-E-A-T.

Google preaches creating content for people and not for search engines. The industry operates otherwise for obvious reasons.

The idea is that if you follow EEAT, Google Search will have an easier time identifying your content as good content and ranking it accordingly. It’s not the actual, direct secret sauce, but it’s your best shot at it.

The problem: What Google says does not correctly match with what Google seemingly does

Over the years, website owners have complained that their traffic remains eroded despite following all the best practices for creating people-first content, as outlined in the Google EEAT content guidelines. People officially involved with Google Search have then made on-the-record comments on what they do and what website owners should or shouldn’t do.

The problem is that the purported leak of Google’s secret sauce Search algorithm does not accurately align with the guidelines and what Google itself has said over the years.

iPullRank says the following:

“Lied” is harsh, but it’s the only accurate word to use here. While I don’t necessarily fault Google’s public representatives for protecting their proprietary information, I do take issue with their efforts to actively discredit people in the marketing, tech, and journalism worlds who have presented reproducible discoveries.

As the initial analysis from iPullRank and SparkToro highlights, this purported algorithm leak contradicts Google’s own words:

  • Domain authority: Google has maintained that it does not use the concept of sitewide “overall domain authority” for ranking SERPs, but the leaked docs suggest that Google computes a characteristic called “siteAuthority.”
  • Using Chrome data for ranking: Google has said that it does not use Google Chrome data as part of organic search. The leaked docs include a few Chrome-related measurement attributes.
  • Clicks: Google Search officials have denied using clicks directly in SERP rankings, but there is plenty of evidence, even beyond the leak, that it does use them as a measure of success. The docs reveal more of the same: Google does have a “click and impression signal” system, which further includes factors like “date of last good click,” and measures results that had the “longest click during the session,” and more.
  • New website sandbox: Google has maintained that there is no sandbox in which websites are segregated based on age or lack of trust signals. The leaked docs include an attribute called “hostAge” that is used specifically to “sandbox fresh spam in serving time.”
  • Authors: Google has maintained that author bylines should be available for reader benefit, not for Google, as they do not impact SERP rankings. The leaked documents indicate that Google at least collected author data on pages, though they stopped short of confirming if it was a ranking metric.

There’s good reason for Google to keep its sauce secret. The problem comes from Google’s willingness to misdirect instead of simply refusing to comment.

Other significant findings from the leaked docs include:

  • Freshness matters: Google looks at dates in bylines, URLs, etc.
  • Links matter: Google looks at link anchors, relevance, and diversity.
  • Branding matters: Branding beyond Google’s ecosystem matters.
  • Change history matters: Google keeps a copy of every version of every page it has ever indexed. However, only the last 20 changes are used.
  • Demotion: Content can be demoted for factors such as links not matching the target site, porn, and more.

The leaked documents are enormous, and we’ll likely see the SEO and content industry pore over all of them in the coming weeks. Numerous theses will be written on how Google Search exactly works and how websites should evolve to succeed in SERP rankings. It’s great to learn more about the inner workings of Google Search, but I am fully aware that incomplete knowledge here will be a double-edged sword.

However, if these leaked documents reinforce one thing, it is that Google keeps its Search secret sauce close to heart, and one should remain skeptical of what company officials say about it on the record. Google has yet to deny the veracity of these leaked documents.

Got a tip? Talk to us! Email our staff at news@androidauthority.com. You can stay anonymous or get credit for the info, it’s your choice.

Latest article