To parse HTML in Go, the best way to start is by using the html package which is a part of the larger golang.org/x/net/html module. This package provides functions for parsing HTML documents and manipulating the parse tree.
Here is an example of how to parse HTML using the golang.org/x/net/html package:
package main
import (
"fmt"
"golang.org/x/net/html"
"log"
"net/http"
"strings"
)
func main() {
// Example HTML data
rawHTML := `
<!DOCTYPE html>
<html>
<head>
<title>Test Page</title>
</head>
<body>
<h1>Hello, World!</h1>
<p>This is a sample paragraph.</p>
<!-- This is a comment -->
<a href="http://example.com">Visit Example.com</a>
</body>
</html>
`
// Parse the HTML
doc, err := html.Parse(strings.NewReader(rawHTML))
if err != nil {
log.Fatal(err)
}
// Function to recursively traverse the HTML node tree
var traverse func(*html.Node)
traverse = func(n *html.Node) {
if n.Type == html.ElementNode {
fmt.Println(n.Data) // Print the name of the HTML element
}
for c := n.FirstChild; c != nil; c = c.NextSibling {
traverse(c)
}
}
// Traverse the HTML document
traverse(doc)
}
Here's how the html package works in the above code:
- The HTML content is parsed by
html.Parse, which takes anio.Readeras an input. In this case, we're usingstrings.NewReaderto convert a string of raw HTML into a reader. - The
traversefunction is a simple recursive function that prints out the name of each HTML element. - The recursion starts with
traverse(doc), wheredocis the root of the document tree.
If you're working with HTML from the internet, you might want to fetch the HTML using the net/http package and then parse it:
resp, err := http.Get("http://example.com")
if err != nil {
log.Fatal(err)
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
log.Fatalf("Error: status code %d", resp.StatusCode)
}
doc, err := html.Parse(resp.Body)
if err != nil {
log.Fatal(err)
}
// Traverse and manipulate `doc` as needed
When working with the parsed HTML, you can use the various fields and types provided by the html package to inspect and manipulate the document. For example, html.Node has fields like Type, Data, Attr, FirstChild, and NextSibling which can be used to navigate and process the HTML tree.
Remember to handle the html.Node types appropriately to check for different node types such as ElementNode, TextNode, CommentNode, etc., as you traverse the HTML tree.
To install the golang.org/x/net/html package, you can use the following command:
go get -u golang.org/x/net/html
This will fetch the package and its dependencies, allowing you to import it into your Go project.