[in progress]Monika Henzinger is the research director of Google Europe, and is giving the keynote at Hypertext 2005, which started this morning in Salzburg.
She’s started by saying how many users and searches there are (a lot) and how old search engines just did text search, whereas Google also analyses links. Future search engines will also use concepts (I think that means: you’re searching on X, which is related to Y, so we’ll give you these results)
Goal of Google’s search engine: Retrieve documents with information content that is relevant to user’s information need. Won’t necessarily find the answer, that’ll be up to the user.
Ranking the retrieved documents is the hardest task.
Usually you base ranking on looking at how often a query term is found within a single document, and how often it’s found in all the documents.
These traditional techniques work well if all the documents you’re searching follow the same format, for instance, they’re all newspaper articles, or all scientific articles. On the web, this doesn’t work, because there are many different kinds of websites, because there is a lot of topic mixture, and because there are lots of people gaming the system.
Hyperlink analysis the solution — invented two places nearly at the same time:
- PageRank (–>Google)
- HITS algorithm developed by a post.doc. at IBM, not used at any commercial search engine.
(She explains PageRank)
–> Works well for distinguishing high-quality from low-quality web pages
–> If all your pages are high-quality, PageRank doesn’t help or hinder
Example: Google “Bush” and you get the white house’s official page. Of course, there are other things but PageRank that help with that, it’s not just PageRank, “but we don’t talk about those other things.”
To start off PageRank to find the FIRST ranks (maybe they still do this regularly to determine PageRank?): “Random walk” – they did a random “walk” through the web by automatically following random outlinks from each website. Problem: they got stuck in garden.com, which hardly has any links OUT of the site. They had a thing where at regular intervals they jumped to a random page they’d already visited. But 80% of the pages were at garden.com – instead they started jumping to random HOSTS they’d visited, which righted this. There is a bias dependent on where they start “walking” – they started at Yahoo, so sites “close” to Yahoo are
A number of other ways of getting samples of the web. Various methods, can’t get a truly uniform sample, but can approximate it. For instance, they can run statistics, to check what kinds of pages tend to get high ranks with their search. (English lanuage? LEngth? No of links?)
Walks in 1999, 2000, 2001.
Trying to get away from the
Linkexchange gets high ranking.
New walk in 2005: current most visited host:
extreme-dm.com/tracking (a tracking site for merchests, really many separate pages)
google.com
shockwave download
sitemeter
adobe (download acrobat reader)
microsoft (download IE)
cyberpatrol (tells you if a site is secure or not)
—etc
Has changed a lot since 2001.
Can use this data to get a roughly uniform sample. How many pages are there in a certain domain? In 2005 they started the walk in Switzerland. Many sites forward you to the GErman site (e.g. google.com–>google.de, same for amazon) therefore .de is overrepresentative. Therefore European sites overrepresented.
.com has grown to 60%
.edu has dropped to 1%
HITS: the other link analysis system.
Neighbourhood graph
take query results, then add everyone linking to them and every site they’re linking to. From this you compute an authority score (good content – shown by inlinks, good if many inlinks, better if high authority inlinks) and hub score (how good are the links? – good hub if many outlinks, even better hub if you link to many high authority links). Recursive: repeated until HUB and AUTH scores converge.
PageRank is better because it’s query-independent, hard to spam. HITs requires you to on the fly compute the neighbourhood graph for each separate search. Also it’s easy to spam. However, an advantage is that it finds hubs (good directory pages), which can be really useful.
Improvements:
- weight the outlinks
- break documents into blocks according to content rather than assign the whole page the same PageRank
- unification of HITS and PageRank to some extent, but nothing really dramatic
Other applications of hyperlink analysis:
- Crawling
- figuring out whether a newspaper is local or national or international – did this by looking at links to the newspaper from universities
Current work: News search based on closed captions from television (typed in for deaf viewers)
Goal would be to have Google finding info for you in the background that matches what you’re watching on TV.
Problems: time lag in typing, also lots of meta-information in the broadcast news (“now we want to go back to Mike” and you don’t want to search that) Also it had to be real time.
Anyway, lots of interesting problems and (partial) solutions.
New research: query phrase search.