Open In App

How to use HTML Agility Pack ?

Last Updated : 24 Jan, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

Web scraping is a common task in programming, where developers need to extract data from websites for various purposes. Manually parsing HTML can be a tedious and error-prone process. The HTML Agility Pack provides a convenient solution for C# developers to parse and manipulate HTML documents easily.

HTML Agility Pack

The HTML Agility Pack is a .NET library that enables developers to parse and manipulate HTML documents flexibly and efficiently. It allows you to navigate the HTML structure, extract data, and modify content effortlessly. In this article, we will explore the installation process, syntax, and approach for using the HTML Agility Pack in C#.

Steps to install the HTML Agility Pack

Before using the HTML Agility Pack, you need to install the NuGet package. The steps for installation are given below:

  • Open your Visual Studio project, and right-click on the project in Solution Explorer.
  • Select “Manage NuGet Packages.” Search for “HtmlAgilityPack” and install it.
  • Once installed, you can start using the HTML Agility Pack in your C# code.

Project Structure

project

Example: Here is a basic example demonstrating how to load an HTML document and extract information using the HTML Agility Pack.

C#




using HtmlAgilityPack;
  
class Program
{
    static void Main()
    {
        // Load HTML document from a file or URL
        var htmlDocument = new HtmlDocument();
        htmlDocument.Load("example.html");
  
        // Select nodes using XPath
        var nodes = 
          htmlDocument.DocumentNode.SelectNodes("//div[@class='example']");
  
        // Extract and display data
        foreach (var node in nodes)
        {
            Console.WriteLine(node.InnerHtml);
        }
    }
}


HTML Agility pack Library Features

HTML Parser

An HTML parser is a tool or library that allows developers to parse HTML documents, breaking down the markup into a structured format that can be easily navigated and manipulated programmatically. Parsing is crucial when working with web scraping or any application that requires extracting information from HTML. The HTML Agility Pack, in the context of C#, is a powerful HTML parser that facilitates the extraction and manipulation of data from HTML documents.

Method/Property

Description

HtmlDocument() Represents an HTML document and provides methods for loading and manipulating HTML content.
HtmlDocument.Load() Loads HTML content from a specified source, such as a file, URL, or stream, into the document.
HtmlDocument.Parse() Parses an HTML string and loads it into the document.
HtmlDocument.DocumentNode Represents the root node of the HTML document.

Example: In this example, the HTML Agility Pack is utilized to create an HTML document object, load HTML content from the specified URL, and extract the title using an XPath selector. For more detailed information and options.

// Create an HTML document object
var htmlDocument = new HtmlDocument();

// Load HTML content from a URL
htmlDocument.Load("https://html-agility-pack.net/parser");

// Accessing the title of the HTML document
var title = htmlDocument.DocumentNode.SelectSingleNode("//title").InnerHtml;

// Output the title
Console.WriteLine("Page Title: " + title);

HTML Selectors

HTML selectors are patterns or expressions used to target specific elements within an HTML document. Selectors enable developers to pinpoint the elements they want to interact with, whether it’s for extraction, modification, or traversal. HTML Agility Pack supports various selector types, including XPath and CSS selectors. XPath expressions are powerful and allow for precise targeting of elements based on their position and attributes, while CSS selectors provide a more familiar syntax for those accustomed to styling web pages.

Selector/Method

Description

SelectSingleNode() Selects the first HTML node that matches the XPath or CSS selector.
SelectNodes() Selects all HTML nodes that match the XPath or CSS selector.
Descendants() Gets all descendant nodes of the current node.
ParentNode Gets the parent node of the current node.
ChildNodes Gets the child nodes of the current node.
Attributes Gets the attributes of the current node.
InnerHtml Gets or sets the HTML between the opening and closing tags of the current node.
OuterHtml Gets or sets the HTML of the current node, including its opening and closing tags.

Example using HTML Agility Pack (XPath):

In this example, HTML Agility Pack is employed with XPath to select anchor (<a>) elements with a specific class attribute.

// Selecting all hyperlinks with a specific class attribute
var links = htmlDocument.DocumentNode.SelectNodes("//a[@class='my-link']");

// Iterating through selected links and outputting their href attributes
if (links != null)
{
    foreach var link in links)
    {
        Console.WriteLine("Link: " + link.Attributes["href"].Value);
    }
}

HTML Manipulation

HTML manipulation involves making changes to the content and structure of an HTML document. This can include modifying attributes, updating text content, adding or removing elements, and more. HTML Agility Pack provides a set of methods and properties that make it easy to manipulate HTML content programmatically. This feature is particularly useful when scraping data from websites, as it allows developers to tailor the extracted information to their specific needs.

Method/Property

Description

Remove() Removes the current node from the HTML document.
RemoveAll() Removes all child nodes of the current node.
ReplaceWith() Replaces the current node with another node or HTML content.
AppendChild() Appends a new child node to the current node.
PrependChild() Prepends a new child node to the current node.
SetAttributeValue() Sets the value of an attribute for the current node.
AppendHtml() Appends HTML content to the inner HTML of the current node.
PrependHtml() Prepends HTML content to the inner HTML of the current node.

Example using HTML Agility Pack:

In this example, HTML Agility Pack is used to select all paragraph (<p>) elements and update their inner HTML content. For comprehensive details on manipulation options.

// Modifying the text content of all paragraph elements
var paragraphs = htmlDocument.DocumentNode.SelectNodes("//p");
if (paragraphs != null)
{
    foreach (var paragraph in paragraphs)
    {
        paragraph.InnerHtml = "This is a modified paragraph.";
    }
}

HTML Traversing

HTML traversing is the process of navigating through the hierarchical structure of an HTML document. It involves moving between parent and child nodes, siblings, and ancestors, providing a comprehensive way to explore the relationships between elements. Traversing is essential for tasks such as iterating through a list of elements, locating related content, or understanding the document’s overall structure. HTML Agility Pack offers a range of traversal methods, making it efficient to navigate HTML documents in C#.

Method/Property

Description

Ancestors() Gets all ancestor nodes of the current node.
NextSibling Gets the next sibling node of the current node.
PreviousSibling Gets the previous sibling node of the current node.
ElementsAfterSelf() Gets all sibling nodes that come after the current node.
ElementsBeforeSelf() Gets all sibling nodes that come before the current node.
XPath Represents an XPath expression used for selecting nodes in the HTML document.
CssSelect() Selects nodes using a CSS selector.

Example:

In this example, HTML Agility Pack is used to locate a specific div element by its ID and navigate to its parent element, retrieving and outputting the parent’s HTML content. The HTML Agility Pack documentation offers comprehensive information on traversal methods and techniques.

// Navigating through the structure to find the parent of a specific element
var specificElement = htmlDocument.DocumentNode.SelectSingleNode("//div[@id='specific-element']");
var parentElement = specificElement?.ParentNode;

// Outputting the HTML content of the parent element
Console.WriteLine("Parent Element HTML: " + parentElement?.InnerHtml);

Example: Here, we are taking a situation where we want to extract the titles of articles from a sample HTML document. Suppose we have an HTML document (sample.html) with the following structure. Now, add the C# program to extract and display the titles of these articles.

HTML




<!DOCTYPE html>
<html>
  
<body>
    <div class="article">
        <h2>Title 1</h2>
        <p>Content of Article 1</p>
    </div>
    <div class="article">
        <h2>Title 2</h2>
        <p>Content of Article 2</p>
    </div>
</body>
  
</html>


C#




#r "nuget: HtmlAgilityPack, 1.11.57"
  
using System;
using HtmlAgilityPack;
  
// Load HTML document from a file
var htmlDocument = new HtmlDocument();
htmlDocument.Load("sample.html");
  
// Select article titles using XPath
var titles = 
      htmlDocument.DocumentNode.SelectNodes("//div[@class='article']/h2");
  
// Display article titles
if (titles != null)
{
    foreach (var title in titles)
    {
        Console.WriteLine(title.InnerHtml);
    }
}


Explanation:

  • Reference Library: The script references the HTML Agility Pack library, a tool for parsing HTML content.
  • Include Necessary Namespaces: The necessary namespaces, System and HtmlAgilityPack, are included to use the features of the referenced library.
  • Create HTML Document Object: An instance of HtmlDocument is created to represent and manipulate the HTML content.
  • Load HTML Content: The script loads HTML content from a file named “sample.html” into the HtmlDocument object.
  • XPath Selection: Using XPath, the script selects specific HTML elements (in this case, <h2> titles) within the loaded HTML document.
  • Display Titles: If titles are found, the script iterates through them and prints their inner HTML content to the console.

Now, run the script with the help of the following command:

dotnet script HtmlAgilityPackExample.cs

Output:

Screenshot-2024-01-18-100329

Advantages & Disadvantages of using HTML Agility Pack

Advantages

  • Flexibility: HTML Agility Pack provides a flexible way to navigate and manipulate HTML documents, making it suitable for various scraping tasks.
  • Support for Invalid HTML: It can handle poorly formatted or invalid HTML, making it resilient in real-world scenarios where websites may not follow standard HTML practices.
  • XPath Support: The ability to use XPath expressions simplifies the process of selecting and extracting specific elements from HTML documents.

Disadvantages

  • Learning Curve: For beginners, there might be a learning curve associated with understanding XPath expressions and the HTML Agility Pack’s API.
  • Performance: While the HTML Agility Pack is efficient, extremely large HTML documents or complex scraping tasks may impact performance.


Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads