Information Retrieval in Hypertext

Home | Hypertext | IR Issues | Auto Link Generation | My System | Bibliography

Navigation or browsing is effective for small hypertext systems. For large hypertext databases, information retrieval (IR) through queries becomes crucial, although some well-structured hypertext systems, such as Victorian Web, can be navigated smoothly even without the help of information retrieval. However, Information retrieval systems serve the purpose of finding data items that are relevant to the users query request. The World Wide Web, as the largest hypertext system, is a tool that has become very popular as a means to easily access information from other sites. It is almost impossible to explore such a huge collection of various hypertext documents. Thus information retrieval plays an extremely critical role in hypertext systems. Of course, conversely, I argue here hyperlinks can greatly reinforce the usage of information retrieval systems.

Conklin had suggested that search and query mechanisms can present information at a manageable level of complexity and detail [Conklin, 1987]. Halasz's view was that "navigational access itself is not sufficient. Effective access to information stored in a hypermedia network requires query-based access to complement and query needs to be elevated to a primary access mechanism on par with navigation." [Halasz, 1988].

Information retrieval is a large research area, mostly concerned with finding information in textual material [Bärtschi85], [Salton89]. The simplest form of information retrieval is the full text search, which finds occurrences of words or phrases specified by the user, combined by boolean operators and weighting of words based on their statistical properties. When a hyperdocument is simply regarded as a text database (ignoring the links) this type of information retrieval is the same as for other textual databases, like dictionaries, encyclopedia, on-line library catalogs, etc.

Finding information is a three-step process:

  1. finding documents: in a centralized hypertext this is no problem; but on the Internet (World Wide Web) it may be difficult to get access to all potentially interesting documents (because there are millions of them, distributed over the whole world).
  2. formulating queries: the user needs to express exactly what kind of information (s)he is looking for.
  3. determining relevance: the system must determine whether a document contains the information the user is looking for.

Traditional information retrieval research and development has concentrated on the second and third step. The distributed nature of the Internet, as well as the size of large hypertexts on CD-ROM, requires shifting the focus towards the first step.

When a text database is large, but centralized, special indexing mechanisms can be employed to speed up the search. For relatively static documents, a popular indexing mechanism is the use of inverted files. Recently a new indexing mechanism, called Glimpse is becoming popular, because it requires much less space overhead than inverted files and other indexing techniques.

Bruza [Bruza-90] proposed a two-level hypertext architecture for hyperdocuments, containing a hyperindex used for information retrieval. First the index term describing the required information would be searched, followed by a "beam down" operation to the hyperdocument itself, to evaluate the selected nodes from the hyperdocument. Bruza proposed measures to determine the effectiveness of index expressions in the hyperindex.

The result of a search may be either a pointer to the first match found, or a scored list of matches. Information retrieval is inherently uncertain: a very general query (like asking for one keyword) may yield too many answers, while a very specific query may result in no answers at all.

Structural querying is what distinguished information retrieval in hypertext from that in ordinary text databases. Beery and Kornatzky [BK90] have suggested a logical query language that allows a combination of structural and content-based queries. The logic is a combination of propositional calculus (without predicates or variables) and quantifiers such as many, most, at least m, exactly n, etc. Attribute/value pairs are used to denote content-properties. Another attempt to develop structural querying facilities is the GraphLog language by Consens and Mendelzon [CM89]. GraphLog is a visual language, based on pattern matching in the graph-structure of the hyperdocument.

Information retrieval in distributed hypertexts is inherently more complicated. Global queries are no longer possible. All search activities have to be done by means of automated browsing. The so-called fish-search is an example of a search tool using this technique. In case links carry information, like attribute/value pairs, that can be useful in determining whether or not to follow links, this information can be used to significantly reduce the search space, as explained in [FS90].