Site Map Links Web Mining Information  Retrieval Glossary Bibliography
Web Content Mining


Web content mining describes the automatic search of information resource available online [Madria 1999], and involves mining web data content. Unlike Web Usage Mining or Web Structure Mining, Web Content Mining emphasis on the content of the web page but the links. 

In the Web mining domain, web content mining essentially is an analog of data mining techniques for relational databases, since it is possible to find similar types of knowledge from the unstructured data residing in Web documents. The Web document usually contains several types of data, such as text, image, audio, video, metadata and hyperlinks. Some of them are semi-structured such as HTML documents, or a more structured data like the data in the tables or database generated HTML pages, but most of the data is unstructured text data. The unstructured characteristic of Web data force the Web content mining towards a more complicated approach.

The Web content mining is differentiated from two different points of view: Information Retrieval View and Database View. R. Kosala et al. [Kosala] summarized the research works done for unstructured data and semi-structured data from information retrieval view. It shows that most of the researches use bag of words, which is based on the statistics about single words in isolation, to represent unstructured text and take single word found in the training corpus as features. For the semi-structured data, all the works utilize the HTML structures insides the documents and some utilized the hyperlink structure between the documents for document representation. As for the database view, in order to have the better information management and querying on the Web, the mining always tries to infer the structure of the Web site of to transform a Web site to become a database.

S. Chakrabarti [Chakrabarti] provide a in-depth survey of the research on the application of the techniques from machine learning, statistical pattern recognition, and data mining to analyzing hypertext. It's a good resource to be aware of the recent advances in content mining research.

Multimedia data mining is part of the content mining, which is engaged to mine the high-level information and knowledge from the large online multimedia sources. Multimedia data mining on the Web has gained many researchers' attention recently. Working towards a unifying framework for representation, problem solving, and learning from multimedia is really a challenge, this research area is still in its infancy indeed, many works are waiting to be done. For the details about multimedia mining, refer [Kasala, Zaiane] to find the related resource information.


Created by Lan Man

Last Modified: Nov 11, 2002