Two approaches towards supporting information retrieval
in distributed hypertexts have been used:
By building (and periodically updating) an index
database for the whole hyperdocument, the first part of a query (finding
candidate documents to be searched) can be supported. The database
can deliver addresses (URL's) of nodes that satisfy certain conditions,
like containing a given word in their title or header.
- Searching can be done by navigation, meaning that nodes are retrieved
by following links, and are scanned for the required information. From
the embedded links in these nodes, new nodes to be retrieved are chosen
and the links leading to them are followed. Since this search mechanism
is time- and network-resource consuming, a clever selection algorithm
and a good starting point are important.
Either way, for a distributed hyperdocument as large and as loosely connected
as the World Wide Web the answers to queries will
most likely be incomplete. A database will probably not contain all information
of all nodes, because the navigation algorithm cannot be certain to locate
all the nodes, given that parts of the Web may be disconnected, and some
nodes may be hidden behind "clickable images" or forms. A navigational-search
will also be incomplete because it does not have the time to scan the
whole hyperdocument, and it too cannot find all the documents because
they may not be reachable by navigation. A reasonable compromise is to
start a navigational search from the answer given by a very large index-database.
For the World Wide Web index databases exist, such as Alta
Vista, while a navigational search algorithm, called the fish-search
is available from the Eindhoven University
As distributed hypertexts are usually read much more
frequently than they are written, their performance benefits greatly from
replication. Just like a cache memory is used between a cpu and main memory,
and between main memory and disk, a cache between a local hypertext browser
and the actual (remote parts of the) hyperdocument can be used to improve
the performance and reduce the network traffic caused by searching for
information in a distributed hypertext.