The Html Agility Pack (HAP) is a .NET library that is used to parse HTML documents and perform web scraping tasks. It provides a flexible and robust API to manipulate HTML documents, both as a whole and at an individual element level. Two of the core classes in the Html Agility Pack are HtmlDocument and HtmlNode. Understanding the difference between these two classes is essential for effectively using the HAP.
HtmlDocument
The HtmlDocument class represents an entire HTML document. It serves as the entry point for parsing HTML content and provides access to the document's overall structure. An instance of HtmlDocument contains the complete DOM (Document Object Model) tree, and it allows you to navigate and query the document using various methods and properties.
When you load HTML content into an HtmlDocument, you are creating a representation of the entire web page, which can then be traversed and manipulated. The HtmlDocument class provides methods to load HTML from a string, file, stream, or web response.
Here's an example of how to load HTML into an HtmlDocument:
var htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.LoadHtml("<html><body><p>Hello, World!</p></body></html>");
HtmlNode
The HtmlNode class represents a single node or element within the HTML document, such as an <a> tag, a <div> block, or a text node. An HtmlNode could be an element node, a comment node, a text node, etc. Each HtmlNode can have child nodes, creating a hierarchy that mirrors the structure of the HTML document.
HtmlNode objects are used to manipulate individual elements within the document. For example, you can change an element's attributes, inner text, or even remove the element altogether. You can also use HtmlNode instances to navigate the DOM tree by accessing parent, sibling, or child nodes.
Here's an example of how to access and manipulate an HtmlNode:
// Assume htmlDoc is an already loaded HtmlDocument as shown above
HtmlNode pNode = htmlDoc.DocumentNode.SelectSingleNode("//p");
if (pNode != null)
{
pNode.InnerHtml = "Goodbye, World!";
}
In this example, we use the SelectSingleNode method on the DocumentNode property of HtmlDocument to find the first paragraph (<p>) element in the document. We then change its inner HTML content.
Summary
HtmlDocumentrepresents the entire HTML document and is the starting point for parsing and manipulating the HTML content.HtmlNoderepresents a single node within the document, which could be an element, text, or comment. It is used for element-level manipulation.
Together, the HtmlDocument and HtmlNode classes provide a powerful way to navigate and edit HTML content in the context of web scraping or any situation where you need to manipulate HTML programmatically using the .NET framework.