Latest Posts

“Page Rank of Google Seach: The Algorithm that Organizes the Web”

“Page Rank of Google Seach: The Algorithm that Organizes the Web”

PageRank at Google Search is correlated to increased revenues in the internet marketing business of Utak Henyo, a social enterprise company that my marketing team just started recently. I need to be involved in social enterprise, a business model for charity or humanitarian purposes to ensure the sustainability of the youth development programs at
International Center for Youth Development (ICYD), a direct partner of the Department of
Education (DepEd) and non-profit corporation that I founded and led since 2008.
The early World Wide Web (WWW) is in chaos in the 1990s that led Page and Brin
to create the PageRank algorithm to organize and make sense of the vastness of the oceanic web leading them to write the mission statement of Google: “to organize the world’s information and make it universally accessible and useful”. The first algorithm used in Google search intends to rank globally all web pages based on the relative importance that appears in the search engine results. The PageRank is named after Google Founder Lawrence Edward Page who received a computer engineering degree at the University of
Michigan in 1995. He also finished a doctoral degree at Stanford University (1998) where he met Sergey Brin, his co-founder at Google. Page and Brin were interested in putting order the gigantic mass of data, and heterogeneity of the web and maximizing the use of its link structure and text. Page and Brin studied related work of Pitkow[Pit97] who had a thesis on “World Wide Web Ecologies”, Weiss [WVS+ 96] who discussed about clustering
methods that gives importance to link structure, Kleinberg[Kle98] who developed an insightful model of the web and Spertus [Spe97] discussed various applications of link structures.

PageRank is one of algorithms used in the search engine, Google. Larry Page described the perfect search engine as “understanding exactly what you mean and giving you back exactly what you want”.

THE BASIC OF GOOGLE SEARCH


To materialize this project, Current Google follows three(3) steps such as “crawling”, “indexing”, and “serving and ranking.”


Step 1: Crawling


Crawling is the process by which Googlebot, an artificial intelligence program who
crawls like a spider navigating and documenting new web page to be added to the Google
Index, a giant database system stored in huge computers. The program decided which
pages or sites to visit and archive, and how frequent. Googlebot was also named spider probably because the term “web” of the World Wide Web(WWW) created by Tim BernersLee at CERN-  Switzerland in 1990 can be analogous or compared to a humungous spider web. Spider documents “new web page URLs” and sitemap data from webmasters. It navigates possible new links and ignores duplicate information. It determines pages blocked in robot.txt but can be recorded if connected to another page or site. It cannot crawl on pages inaccessible to anonymous users.


Step 2: Indexing


Indexing is the process of understanding the content of the page and site after being
discovered by spiders. It analyzes the content implementing lexical analysis, categorizing videos and images or pictures, and storing or adding it to Google Search Index, the colossal server that controls hundreds of billion web pages and more than 100,000,000 gigabytes
in sizes. Indexing is comparable to an index at the back of the book in the University library
comprised of lists of words. The spider can process multiple contents except for rich media files. It does not limit itself to word analysis but also uses other useful information such as
locations and interests of the web surfers. Nowadays Google search does not just facilitate
matching keywords from Google search entries but also helps access millions of books
from major libraries and public data from institutions like the World Bank.


Step 3: Serving and Ranking


When a web surfer enters a keyword on the Google search, the spider sent signals to the learned machine and returned results based on relevancy. Matt Cutts, an engineering
quality at Google stated that their system used 200 factors or criteria to ensure that
information sent back was germane to the user’s needs. Some of the factors were “words of your query”, relevance and usability of pages”, “expertise of sources”, and “your location
and settings”. Google also monitors the user experience, the speed of the web pages, and its user-friendliness.


a. Meaning of your query


What information you are looking for? What is the intention behind the
query? Is it a specific or broad search?
Google developed programs based on the latest research on Natural Language
Understanding(NLU) that can address this concern, interpret spelling mistakes, and classify
various questions. They created synonym programs that match similar meanings and took them 5 years to finish making 30% improvement results using various languages.


To address the necessity of fresh information, Google also created freshness algorithms
that interprets latest information and trends like for example, PBA scores.


b. Relevance of webpages


What is the relevant information to your query?
Google created a search algorithm that detects quantifiable signals to estimate
relevance. The basic indication of relevance is “when a page contains the same
keywords as your search query.” The limitation of the program is handling abstract
views, for example, complex political or religious views.


c. Quality of content


How trustworthy the information is?
The search algorithm has the capacity to determine if the data demonstrate
“expertise, authoritativeness, and trustworthiness”. Google also used spam
algorithms to evaluate low-quality pages and ensure that these links will not appear
in search results.


d. Usability of webpages


Is the page adjustable to various devices such as mobile, desktop,
tablets? Is it viewable on a slow internet connection?
Google also created algorithms to evaluate if a page is user friendly or not. In
January 2018, their program considers the speed of the page.


e. Context and settings


What is the most relevant and useful at the moment?
Google considers location and past search history to decide what results to
produce the best results.


PAGE RANK: THE FIRST ALGORITHMS


I cited a couple of algorithms such as search, synonyms, spams, and PageRank.
PageRank is one of the algorithms for the FIRST. As I mentioned earlier, the dark ocean of the web led to the creation of Page Rank. The original agenda of Page Rank “was a way to sort backlinks so if there were a large number of backlinks for a document, the ‘best’ backlinks could be displayed first”. Every time a professor asks the students to write a paper, he asks
everyone to cite the references. Every link on the web is like academic citations or references. For certain, the page of Google has massive backlinks pointing to it. In Figure 1, web page A
& B are backward links of Webpage C or forward links pointing to C.

The PageRank is described as “A page has a high rank if the sum of the ranks of its backlinks is high. This covers both the case when a page has many backlinks and when a
the page has a few highly ranked backlinks.”
 
The PageRank is defined as “Let be a web page. Then let
Fu be the set of pages
points to and Bu be the set of pages that point to u. Let
Nu = [Fu] be the number of links from and let
c  be a factor used for normalization (so that the total rank of all web pages is constant). We begin by defining a simple ranking, R which is a slightly simplified version of PageRank:

The simplified versions of Page Rank encountered issues such as Rank Sink and Dangling Links. The Google team developed the Rank source model to overcome the rank
sink(See figure 4). They described Rank Sink as,
“Consider two web pages that point to each other but to no other page. And suppose
there is some web page which points to one of them. Then, during iteration, this loop will
accumulate rank but never distribute any rank (since there are no outages)”.

Dangling links are defined as “simply links that point to any page with no outgoing link…. Because dangling links do not affect the ranking of any other page directly, we simply remove them from the system until all the PageRanks are calculated. After all the
PageRanks are calculated, they can be added back in, without affecting things significantly.”
To implement Page Rank, they had built a “complete crawling and indexing system (remember the index at the back of the books) that had a repository of 24 million pages in 1998.”


The original applications of the Page rank are not just to give a ranking of all pages but
enable us to search for high-quality results and essentials, estimate traffic, and assist users on
deciding which links in a long list are “more likely to be interesting”.


References
1. Lawrence, Brin, Sergey, Motwani, Rajeev, Winograd, and Terry. “The PageRank Citation
Ranking: Bringing Order to the Web.” Stanford InfoLab Publication Server. Stanford
InfoLab, November 11, 1999. http://ilpubs.stanford.edu:8090/422/.
2. https://www.youtube.com/watch?v=BNHR6IQJGZs&t=22s
3. Introduction to computing by Tomas B. Cabagay
4. https://support.google.com/webmasters/answer/70897?hl=en

(The author is dedicating this article to Dr. Allan Sioson, Ateneo Professor and Outstanding Young Scientist Awardee)