Open In App

What is Web Content Mining?

Last Updated : 30 Nov, 2022
Improve
Improve
Like Article
Like
Save
Share
Report

Pre-requisites: Web Mining

Web Content Mining is one of the three different types of techniques in Web Mining. In this article, we will purely discuss Web Content Mining. Mining, extraction, and integration of useful data, information, and knowledge from Web page content are known as Web Mining.

It describes the discovery of useful information from web content. In simple words, it is the application of web mining that extracts relevant or useful information content from the Web. Web Content mining is somehow related but different from other mining techniques like data mining and text mining. Due to heterogeneity and the absence of web data, automated discovery of new knowledge patterns can be challenging to some extent. 

Web data are generally semi-structured and/or unstructured, while data mining is primarily concerned with structured data . It performs scanning and mining of text, image and images, and groups of web pages according to the content of input by displaying the list in search engines.

For Example: if the user is searching for a particular song then the search engine will display or provide suggestions relevant to it.

Web content mining deals with different kinds of data such as text, audio, video, image, etc.

Unstructured Web Data Mining

Unstructured data includes data such as audio, video, etc,  We convert these unstructured data into structured data,i.e., into useful information or structured information (which is known as Web Content Mining). the process of Conversion is mentioned as follows:

Web Content Mining

 

Unstructured Documents Feature Extraction:

1. Bag of words to represent unstructured documents

  • Takes a single word as a feature.
  • It ignores the sequence or order in which words occur.

2. Features could be:

  • Boolean: This would either occur or may not occur in the document.
  • Frequency-based: A number of times the word is repeated in the particular document.

3. Variations of the feature selection include:

  •  Removal of the case, punctuation, less frequent words and also top words, etc.

4. Features can be reduced using different feature selection techniques:

  • Gain of Information, measuring of difference between the probability distribution.
  • Stemming: it reduces words to their morphological roots.

Mining Techniques Using Agents and Databases:

1. Agent-Based Approaches:

  • Intelligent- Search- This type of search basically refers to a particular goal of the user and will return the results based on the conclusion of that goal.
  • Information-Filtering/ Categorization – This type of search basically deals with the filtering of data, i.e., removal of unwanted information or redundant information using certain ai based methods. Like, HyPursuit, BO ( Bookmark Organizer).
  • Growth of Sophisticated AI systems replacing users in an automated or unautomated manner. One of these is Deep Learning, wherein the system is trained by feeding it with certain kinds of data.

2. Database Approaches:

Used for transforming unstructured data into a more structured and high-level collection of resources, such as in relational databases, and using standard database querying mechanisms and data mining techniques to access and analyze this information. 

  • Multilevel Databases:
    • Lowest Level – semi-structured information is kept.
    • High Level- generalization from lower levels organized into relations and objects.
  • Web Query Systems:
    • Web-query systems are developed such as SQL, and Natural Language Processing for extracting data.
Web Content Mining Categorization

 

Web Content Mining Techniques:

  1. Pre-processing 
  2. Clustering
  3. Classifying
  4. Identifying the associations
  5. Topic identification, tracking, and drift analysis

Applications of Web Content Mining:

  1.  Classifying the web documents into categories.
  2.  Identify topics of web documents.
  3.  Finding similar web pages across the different web servers.
  4.  Applications related to relevance.

Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads