Share

The Google algorithm and its history, is ranking just mathematics?

The founders of Google, right from the first implementation of their search engine, realized that a classification system based on the number and quality of links that each page on the web received would produce better results than other existing techniques. The mathematical translation of this aspect inevitably led to the entry into the algorithm of some parameters resulting not only from algebraic and probabilistic calculations

The Google algorithm and its history, is ranking just mathematics?

In the early days of the Internet, the main concern of those who wanted to offer a system to catalog the growing number of sites present on the network was that of "matching“, i.e. the correspondence between the topics covered and the category within which each site was inserted. Subsequently, the need for the search engine arose, a tool that relieved surfers from the task of leafing through all the pages of an Internet site in search of the specific topic of interest. Before Google, the search engine landscape was populated by a notable amount of alternatives, which allowed you to find information starting from simple keywords. To order the pages that contained this information, the most used system was the one that built a ranking of the sources based on number of times that a search term appeared on the same page.

Sergey Brin and Larry Page, the founders of Google, immediately thought that their search engine should have something extra and therefore concentrated precisely on ranking function. The goal was to produce results that were not only relevant, but also authoritative, i.e. a ranking of reliable sources of correct information, with a strong presence on the web. The mathematical studies of the two therefore brought their attention to the "Markov chain“, a tool of probability theory capable of considering the state of a system at time t, but also of predicting its transition towards a certain direction (another state), based on the state of the system at the immediately previous time. In this way - still - it is possible to schematize the links of web pages that point to other web pages as if they were state transitions, giving a weight to these links based on the number and authority of the pages from which they come. The analogy is evident by observing the diagrams representing the connections between states in Markov processes: the numbers quantify the probability of the process changing from one state to another and the arrows indicate the direction of this change.

Thanks to the work of Brin and Page, which was certainly not the only one that went in this direction, the paradigm of ranking algorithms, since 1998, has changed radically and has settled, almost definitively, at "Link Analysis Ranking“, where hypertext structures are used to classify web pages. In a certain sense, a link from page Y to page X can be seen as an approval of the quality of page The job of the ranking function is to extract this information and produce a ranking that reflects the relative authority of the pages.

At the beginning of the 2000s, Google was not yet the most used search engine in the world and the algorithms based on obsolete ranking systems survived, as did the "web directories", portals in which the resources are organized by thematic areas and are presented as indexes or as trees that branch into more specific nodes. Matrices, probability distributions, vectors and stochastic processes are at the center of the description of patent pending, in 2001, by Brin and Page for their PageRank. Once the values ​​of the starting variables have been set, the algorithm is able to generate a classification of results for each key phrase. Algebraic and probabilistic calculations govern the positioning of pages on the world wide web. Yet there is something wrong.

Google staff realize that they have essentially provided the instructions to end up at the top of its rankings, deceiving the algorithm. The content creators follow all the instructions to the letter, compete to exchange authoritative links and fill their pages with keywords relating to certain topics, but then, on the site they talk about something else. Someone uses "trending topics" (most searched topics) as bait to sell products and services. Criminals use them to inoculate computer viruses. In short: it spam, previously conveyed almost exclusively via email, has definitively landed on the web. But the incorrectness is not only the more or less obvious one of the spam (which still claims a significant number of victims). Even the world of advertising, corporate communication, copywriting and even journalism or entertainment is jostling to have "a front row seat" and does so by focusing - perhaps too much - on that concept which in the future will be identified through the acronym SEO (Search Engine Optimization).

We need a new algorithm, or perhaps human intervention. Google opts for a mix of the two contributions and buys from Yahoo! – in 2005 – the patent of TrustRank, a link analysis algorithm capable of distinguishing spam pages from those with "useful" content. TrustRank is paired with PageRank and is, in part, based on human factor, that of the “quality raters” by Google. Human intervention is indirect, not immediate, it serves to correct unsatisfactory results according to a group of people (who are not Google employees). The leap is done. Mathematics steps aside a bit and the human brain comes into play. Google quality raters provide ratings according to precise guidelines, but representing real users, real information needs and using human judgment and not the result of mathematical or probabilistic calculations. All this is inserted into the algorithm through parameters which can be summarized once again in an acronym: EAT (Expertise, Authority, Trustworthiness) or authority, competence and reliability.

At this point it should be clear that human intervention, in the process of creating the SERP (Search Engine Result Page), although indirect, plays a very important role. So, what is the procedure followed by Google's algorithm every time we type a keyword in the search engine's designated space? First of all, it is necessary to make a necessary clarification on the meaning of the word "algorithm“. In addition to being confused far too often with the mathematical term "logarithm" - which has nothing to do with it - it is wrongly considered to be something artificial, necessarily complex or which concerns purely technological aspects. Instead, it is a word which, semantically stripped of some of its characteristics, could simply be a synonym of "procedures“. Yes, because, in a certain sense, even a recipe for making a cake could be seen as an example of an algorithm. An instruction manual would be even more so. But what is it then that makes an algorithm a “special procedure”? The fact that it is made up of a finite number of instructions (therefore that it has a term), that these are uniquely interpretable and that they always lead to the same results starting from the same specifications. Furthermore, it must have a general character, that is, it must be applicable to all the problems of the set to which it refers. In computer journalistic language, however technical, the term algorithm is now extended to any sequence of instructions that can be fed to an automaton.

The key steps of a Google search up to the results page

The Google algorithm, after having indexed the pages of our crawled website, proceeds in this way, following the typing of a phrase/keyword:

  • Exact match keyword/phrase search (matching)
  • Semantic matching phrase/keyword search (meaning)
  • Producing an ordered list of web pages using algorithms (positioning)

All these steps are very quick because the structure of a search engine makes use of so-called "data centers", gigantic warehouses containing high-performance computers, specialized in single tasks (servers). The actual search does not take place by entering all the computers in the world where there are shared resources (web pages), but only in a small number of machines concentrated in these data centers which have a copy of the contents of all the indexed sites. Throughout the world, for example, i Google datacenter – including those of its partners – are “only” 34. Markov chains are explored not directly, but through a matrix representation of the graph (which only serves us as a model of the web). In essence, the mathematical translation of the algorithm's processes guarantees the speed that we are used to seeing in obtaining a response from the search engine.

Google may decide to scan our site and make a copy of it on its data center servers either spontaneously or following a report from us (via a tool called Google Search Console). Not all pages are indexed, but only those that, according to its parameters (all verifiable), do not present problems. After theindexing, occurs – at the request of users who search through Google – on positioning, the result of the intervention of the algorithm. Each page positioning is related to a particular phrase or keyword. This is obviously not an absolute positioning. It also varies depending on the geographical location from where the search is started and on the various personal information that the user has allowed the browser to store and share.

It is always the algorithms that produce the ordered list of search engine results pages (SERP), there is no office with real people responsible for selecting content that rewards one source and discards another. L'human intervention, as already explained, is limited to the feedback of the evaluators (search quality raters) and this feedback is always translated into parameters compatible with automatic machine learning.

comments