The ‘Clever’ Algorithm – HITS

Just last night I finished reading this paper [KLEIN99], that presumably acted as the starting point for the research into the popular Pagerank algorithm used by Google. It presents a clever algorithm for searching queries over the World Wide Web (WWW) by addressing two major problems (scarcity and abundance) faced by the state-of-the-art text-based search engines at the time. Scarcity appears when the queries are so specific that only a very few pages in the WWW contain the query word, while abundance is when this page count is huge, typically when dealing with broad-topic queries. The examples which the paper gives for such queries are “Does Netscape support the JDK 1.1 code-signing API?” for specific queries and “Find information about the Java programming language.” for broad-topic queries. Thus, to provide an efficient search, one needs to filter out the more relevant pages, which the author calls authoritative. However, an immediate consequence of this approach is the problem while quantifying the term authoritative. More concretely, in the words of Kleinberg, given a particular page, how do we tell whether it is authoritative?

A simple heuristic that strikes now is to order the pages, that contain the given input query, on the basis of their in-degrees. This way, we are giving more weight to pages that are referenced by a large number of pages in the WWW, indicating a sense of importance, or in this case, possible authority over the other pages. However, an easy hack around this situation is for the owner of a page to create a bunch of dummy purposes whose sole purpose is to have links pointing to the page of interest. This increase in in-degree will enhance the authority of the page to an arbitrary extent and make it appear as a search result for whatever query this owner wants.

The solution to the problem is to try creating a focussed subgraph of the WWW upon which the search will be performed. This graph should be small (to not overload the servers) and must contain many of the strong authorities (pages that are most relevant to the query). Kleinberg makes an attempt at this construction through the use of principal eigenvectors of AA^T and A^TA, where A is the adjacency matrix for a reduced subgraph G_\sigma that is constructed as follows. A primary assumption made here is that A has a non-zero spectral gap, which implies that the graph is expanding [KOT12].

Assume that the WWW is represented as a directed graph G=(V,E) where the pages form the vertices and there is an edge (i,j) \in E if page i has a link to page j. Then, given a broad-topic query \sigma, determine a subgraph S_\sigma on which the search query will be efficient to run. To obtain S_\sigma:

  1. Select a parameter t, say 200, and obtain t highest ranked pages for \sigma using a text-based search engine (e.g. Alta Vista at the time). Call this set R_\sigma (augment this with the links between pages in R_\sigma to obtain a graph). Note that R_\sigma is small (by keeping t small) and potentially contains many strong authorities (relying on our faith in the search engine used). However, R_\sigma fails to be rich in relevant pages. This is because of the problems in text based searching where queries like “What is a good search engine?” will probably never return any of the existing search engine websites in the output because most of these websites do not contain the words “search engine” in them.
  2. Expand R_\sigma by adding pages (and links) that enter and leave R_\sigma and call this new graph S_\sigma. We have now added potential relevant pages to our graph under the assumption that these “neighbor” pages contain information that is crucial to the query.

Once this graph S_\sigma has been obtained, some more tweaking needs to be done to avoid returning pages that contain many navigational links. For example, a page that contains many links to various parts of itself (frequently used in pages containing long articles with sections), will be assumed to be of high authority because we are biasing our pages based on their indegree. Hence, obtain the final graph G_\sigma by removing all those edges from S_\sigma which connect pages in R_\sigma. Finally, order pages in G_\sigma in decreasing order of in-degree as an estimate of their authority.

The final trick employed by Kleinberg to weight authority of a page relative to its neighbors in G_\sigma is to mark those set of pages as strong authorities which have a high in-degree in G_\sigma and have a significant overlap in the pages linking to them. The idea is based on the existence of hubs, which are major sources of links to these high-in-degree pages. Essentially, this ensures that a page is authoritative if it pointed by a strong hub, and a hub is strong if it points to an authority. To work around this circularity, Kleinberg operates an iterative algorithm to update the degree of authority and degree of hub-ness, through parameters called authority weight and hub weight, respectively:

  1. To each page p in G_\sigma, assign an authority weight x^{\langle p \rangle} = 1 and a hub weight y^{\langle p \rangle} = 1.
  2. Normalize these weights to maintain the invariant \sum_p \left( x^{\langle p \rangle} \right)^2 =\sum_p \left( y^{\langle p \rangle} \right)^2 = 1.
  3. (\mathcal{I}-operation) Update the authority weights as x^{\langle p \rangle} \gets \sum_{q:(q,p) \in E(G_\sigma)} y^{\langle q \rangle}. Here, E(G_\sigma) is the set of edges in G_\sigma. Essentially, this is saying that the authority weight of the page p is updated as the sum of hub weights of all the pages pointing to it.
  4. (\mathcal{O}-operation) Similarly, update the hub weights as y^{\langle p \rangle} \gets \sum_{q:(q,p) \in E(G_\sigma)} x^{\langle q \rangle}.
  5. Normalize both x^{\langle p \rangle} and y^{\langle p \rangle} for each page p.
  6. Repeat steps 3-5 until convergence.
  7. Return the top k pages (for required k) that have the highest authority weights.

The paper presents a proof that the weight vectors will converge (assuming a non-zero spectral gap in G_\sigma) and provides experimental evidence to show this typically happens within 20-30 steps in practice. Clearly, this clever reduction in the search space for the query to a graph that is very likely to contain the strongest authorities and many relevant pages has siginificantly reduced the search time as well as improved the quality of search results (as is evidenced by the results in the paper). This cleverness is hence, undoubtedly, used as a starting point for the giant of search engines today.

References :

[KLEIN99] Kleinberg, Jon M. “Authoritative sources in a hyperlinked environment.”Journal of the ACM (JACM) 46.5 (1999): 604-632.

[KOT12] Kotowski, Marcin, and Micha l Kotowski. “Lectures on expanders.” (2012).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s