Just last night I finished reading this paper [KLEIN99], that presumably acted as the starting point for the research into the popular Pagerank algorithm used by Google. It presents a clever algorithm for searching queries over the World Wide Web (WWW) by addressing two major problems (scarcity and abundance) faced by the state-of-the-art text-based search engines at the time. Scarcity appears when the queries are so specific that only a very few pages in the WWW contain the query word, while abundance is when this page count is huge, typically when dealing with broad-topic queries. The examples which the paper gives for such queries are “Does Netscape support the JDK 1.1 code-signing API?” for specific queries and “Find information about the Java programming language.” for broad-topic queries. Thus, to provide an efficient search, one needs to filter out the more relevant pages, which the author calls authoritative. However, an immediate consequence of this approach is the problem while quantifying the term authoritative. More concretely, in the words of Kleinberg, given a particular page, how do we tell whether it is authoritative?
A simple heuristic that strikes now is to order the pages, that contain the given input query, on the basis of their in-degrees. This way, we are giving more weight to pages that are referenced by a large number of pages in the WWW, indicating a sense of importance, or in this case, possible authority over the other pages. However, an easy hack around this situation is for the owner of a page to create a bunch of dummy purposes whose sole purpose is to have links pointing to the page of interest. This increase in in-degree will enhance the authority of the page to an arbitrary extent and make it appear as a search result for whatever query this owner wants.
The solution to the problem is to try creating a focussed subgraph of the WWW upon which the search will be performed. This graph should be small (to not overload the servers) and must contain many of the strong authorities (pages that are most relevant to the query). Kleinberg makes an attempt at this construction through the use of principal eigenvectors of and , where is the adjacency matrix for a reduced subgraph that is constructed as follows. A primary assumption made here is that has a non-zero spectral gap, which implies that the graph is expanding [KOT12].
Assume that the WWW is represented as a directed graph where the pages form the vertices and there is an edge if page has a link to page . Then, given a broad-topic query , determine a subgraph on which the search query will be efficient to run. To obtain :
- Select a parameter , say 200, and obtain highest ranked pages for using a text-based search engine (e.g. Alta Vista at the time). Call this set (augment this with the links between pages in to obtain a graph). Note that is small (by keeping small) and potentially contains many strong authorities (relying on our faith in the search engine used). However, fails to be rich in relevant pages. This is because of the problems in text based searching where queries like “What is a good search engine?” will probably never return any of the existing search engine websites in the output because most of these websites do not contain the words “search engine” in them.
- Expand by adding pages (and links) that enter and leave and call this new graph . We have now added potential relevant pages to our graph under the assumption that these “neighbor” pages contain information that is crucial to the query.
Once this graph has been obtained, some more tweaking needs to be done to avoid returning pages that contain many navigational links. For example, a page that contains many links to various parts of itself (frequently used in pages containing long articles with sections), will be assumed to be of high authority because we are biasing our pages based on their indegree. Hence, obtain the final graph by removing all those edges from which connect pages in . Finally, order pages in in decreasing order of in-degree as an estimate of their authority.
The final trick employed by Kleinberg to weight authority of a page relative to its neighbors in is to mark those set of pages as strong authorities which have a high in-degree in and have a significant overlap in the pages linking to them. The idea is based on the existence of hubs, which are major sources of links to these high-in-degree pages. Essentially, this ensures that a page is authoritative if it pointed by a strong hub, and a hub is strong if it points to an authority. To work around this circularity, Kleinberg operates an iterative algorithm to update the degree of authority and degree of hub-ness, through parameters called authority weight and hub weight, respectively:
- To each page in , assign an authority weight and a hub weight .
- Normalize these weights to maintain the invariant .
- (-operation) Update the authority weights as . Here, is the set of edges in . Essentially, this is saying that the authority weight of the page is updated as the sum of hub weights of all the pages pointing to it.
- (-operation) Similarly, update the hub weights as .
- Normalize both and for each page .
- Repeat steps 3-5 until convergence.
- Return the top pages (for required ) that have the highest authority weights.
The paper presents a proof that the weight vectors will converge (assuming a non-zero spectral gap in ) and provides experimental evidence to show this typically happens within 20-30 steps in practice. Clearly, this clever reduction in the search space for the query to a graph that is very likely to contain the strongest authorities and many relevant pages has siginificantly reduced the search time as well as improved the quality of search results (as is evidenced by the results in the paper). This cleverness is hence, undoubtedly, used as a starting point for the giant of search engines today.
[KLEIN99] Kleinberg, Jon M. “Authoritative sources in a hyperlinked environment.”Journal of the ACM (JACM) 46.5 (1999): 604-632.
[KOT12] Kotowski, Marcin, and Micha l Kotowski. “Lectures on expanders.” (2012).