[in progress]Monika Henzinger is the research director of Google Europe, and is giving the keynote at Hypertext 2005, which started this morning in Salzburg.

She’s started by saying how many users and searches there are (a lot) and how old search engines just did text search, whereas Google also analyses links. Future search engines will also use concepts (I think that means: you’re searching on X, which is related to Y, so we’ll give you these results)

Goal of Google’s search engine: Retrieve documents with information content that is relevant to user’s information need. Won’t necessarily find the answer, that’ll be up to the user.
Ranking the retrieved documents is the hardest task.
Usually you base ranking on looking at how often a query term is found within a single document, and how often it’s found in all the documents.

These traditional techniques work well if all the documents you’re searching follow the same format, for instance, they’re all newspaper articles, or all scientific articles. On the web, this doesn’t work, because there are many different kinds of websites, because there is a lot of topic mixture, and because there are lots of people gaming the system.

Hyperlink analysis the solution — invented two places nearly at the same time:

  • PageRank (–>Google)
  • HITS algorithm developed by a post.doc. at IBM, not used at any commercial search engine.

(She explains PageRank)
–> Works well for distinguishing high-quality from low-quality web pages
–> If all your pages are high-quality, PageRank doesn’t help or hinder

Example: Google “Bush” and you get the white house’s official page. Of course, there are other things but PageRank that help with that, it’s not just PageRank, “but we don’t talk about those other things.”

To start off PageRank to find the FIRST ranks (maybe they still do this regularly to determine PageRank?): “Random walk” – they did a random “walk” through the web by automatically following random outlinks from each website. Problem: they got stuck in garden.com, which hardly has any links OUT of the site. They had a thing where at regular intervals they jumped to a random page they’d already visited. But 80% of the pages were at garden.com – instead they started jumping to random HOSTS they’d visited, which righted this. There is a bias dependent on where they start “walking” – they started at Yahoo, so sites “close” to Yahoo are

A number of other ways of getting samples of the web. Various methods, can’t get a truly uniform sample, but can approximate it. For instance, they can run statistics, to check what kinds of pages tend to get high ranks with their search. (English lanuage? LEngth? No of links?)

Walks in 1999, 2000, 2001.
Trying to get away from the
Linkexchange gets high ranking.
New walk in 2005: current most visited host:
extreme-dm.com/tracking (a tracking site for merchests, really many separate pages)
google.com
shockwave download
sitemeter
adobe (download acrobat reader)
microsoft (download IE)
cyberpatrol (tells you if a site is secure or not)
—etc

Has changed a lot since 2001.

Can use this data to get a roughly uniform sample. How many pages are there in a certain domain? In 2005 they started the walk in Switzerland. Many sites forward you to the GErman site (e.g. google.com–>google.de, same for amazon) therefore .de is overrepresentative. Therefore European sites overrepresented.
.com has grown to 60%
.edu has dropped to 1%

HITS: the other link analysis system.
Neighbourhood graph
take query results, then add everyone linking to them and every site they’re linking to. From this you compute an authority score (good content – shown by inlinks, good if many inlinks, better if high authority inlinks) and hub score (how good are the links? – good hub if many outlinks, even better hub if you link to many high authority links). Recursive: repeated until HUB and AUTH scores converge.

PageRank is better because it’s query-independent, hard to spam. HITs requires you to on the fly compute the neighbourhood graph for each separate search. Also it’s easy to spam. However, an advantage is that it finds hubs (good directory pages), which can be really useful.

Improvements:

  • weight the outlinks
  • break documents into blocks according to content rather than assign the whole page the same PageRank
  • unification of HITS and PageRank to some extent, but nothing really dramatic
  • Other applications of hyperlink analysis:

  • Crawling
  • figuring out whether a newspaper is local or national or international – did this by looking at links to the newspaper from universities
  • Current work: News search based on closed captions from television (typed in for deaf viewers)
    Goal would be to have Google finding info for you in the background that matches what you’re watching on TV.
    Problems: time lag in typing, also lots of meta-information in the broadcast news (“now we want to go back to Mike” and you don’t want to search that) Also it had to be real time.
    Anyway, lots of interesting problems and (partial) solutions.

    New research: query phrase search.


Discover more from Jill Walker Rettberg

Subscribe to get the latest posts sent to your email.

Leave A Comment

Recommended Posts

Academics in Norway: Sign this petition asking for research-based discussions of how to use AI in universities

I just signed a petition calling for Norwegian universities to use research expertise on AI when deciding how to implement it, rather than having decisions be made mostly administratively. ,  If you are a researcher in Norway, please read it and sign it if you agree – and share with anyone else who might be interested. The petition was written by three researchers at UiT: Maria Danielsen (a philosopher who completed her PhD in 2025 on AI and ethics, including discussions of art and working life), Knut Ørke (Norwegian as a second language), and Holger Pötzsch (a professor of media studies with many years of research on digital media, video games, disruption, and working life, among other topics).  This is not about preventing researchers from exploring AI methods in their research. It is about not uncritically accepting the hype that everyone must use AI everywhere without critical reflection. It is about not introducing Copilot as the default option in word processors, or training PhD candidates to believe they will fall behind if they do not use AI when writing articles, without proper academic discussion. Changes like these should be knowledge-based and discussed academically, not merely decided administratively, because they alter the epistemological foundations of research. Maria wrote to me a couple of months ago because she had read my opinion piece in Aftenposten in which I called for a strong brake on the use of language models in knowledge work. She was part of a committee tasked with developing UiT’s AI strategy and was concerned because there was so much hype and so few members of the committee with actual expertise in AI. I fully support the petition. There are probably some good uses for AI in research, but the uncritical, hype-driven insistence that we must simply adopt it everywhere is highly risky. There are many researchers in Norway with strong expertise in AI, language, ethics, working life, and culture. We must make use of this expertise. This is also partly about respect for research in the humanities, social sciences, psychology, and law. Introducing AI at universities and university colleges is not merely a technical issue, and perhaps not even primarily a technical one. It concerns much more: philosophy of science, methodological reflection, epistemology, writing, publishing, the working environment, and more. […]

screenshot of Grammarly - main text in the middle, names of experts on the left with reccomendations and on the right more info about the expert review feature
AI and algorithmic culture Teaching

Grammarly generated fake expert reviews “by” real scholars

Grammarly is a full on AI plagiarism machine now, generating text, citations (often irrelevant), “humanizing” the text to avoid AI checkers and so on. If you’re an author or scholar, they also have been impersonating and offering “feedback” in your name. Until yesterday, when they discontinued the Expert Review feature due to a class action lawsuit. Here are screenshots of how it worked.