Getting Started With HTML Agility Pack
Friends,
In this post we will see how to get started with HTML Agility Pack and code samples to see how web scraping can be achieved using this package in C#. For users who are unaware about “HTML Agility Pack“, This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT. In simple words, it is a .NET code library that allows you to parse “out of the web” files (be it html/php/aspx).
To make it more simpler, you can scrape web pages present on Internet using this library.
How to Get HTML Agility Pack in your application
You can get HTML Agility Pack in your application using Nuget. To install it in your project, you can just use the following in Package Manager Console
Install-Package HtmlAgilityPack
Read this: How to add Nuget packages in your project
After adding the reference via Nuget, you need to include the reference in your page using –
using HtmlAgilityPack;
Load a Page From Internet
To load a page directly from Web, you can use the following code:
HtmlWeb web = new HtmlWeb(); HtmlDocument document = web.Load("http://www.c-sharpcorner.com");
After executing this 2 lines of code, we have the entire page of http://c-sharpcorner.com in document object of HtmlDocument class.
Load a Page from a Saved Document
Several times we need to load a HTML document from a saved file from our hard disk. To load a HTML document from a saved file, we need to write the following code –
HtmlDocument document2 = new HtmlDocument(); document2.Load(@"C:\Temp\sample.txt")
At this point, we have the entire HTML parsed and loaded in document2 object.
At this point, let us see a sample HTML that we’re using in sample.txt file –
Link 1 inside div1 Link 2 inside div1Link 3 outside all divsLink 1 inside div2 Link 2 inside div2
Get all Hyperlinks in a page
Once we have the HTML document loaded, let us see how can we get all hyperlinks from the page.
HtmlDocument document2 = new HtmlDocument(); document2.Load(@"C:\Temp\sample.txt") HtmlNode[] nodes = document2.DocumentNode.SelectNodes("//a").ToArray(); foreach (HtmlNode item in nodes) { Console.WriteLine(item.InnerHtml); }
This will output the following text –
Select a specific div in a page
To get a specific div in a page, we will use the following code –
HtmlDocument document2 = new HtmlDocument(); document2.Load(@"C:\Temp\sample.txt") HtmlNode node = document2.DocumentNode.SelectNodes("//div[@id='div1']").First();
This code will select the div with id ‘div1’ from the page and return in the Node. You can now iterate on the ChildNodes property of HtmlNode class to get further child elements of the DOM element.
Select all Hyperlinks within a specific div
To select all hyperlinks within a specific div, we can use the following 2 ways –
HtmlDocument document2 = new HtmlDocument(); document2.Load(@"C:\Temp\sample.txt") //Approach 1 HtmlNode node = document2.DocumentNode.SelectNodes("//div[@id='div1']").First(); HtmlNode [] aNodes = node.SelectNodes(".//a").ToArray(); //Approach 2 HtmlNode [] aNodes2 = document2.DocumentNode.SelectNodes("//div[@id='div1']//a").ToArray();
The above code will give the following output –
Filter hyperlinks for certain conditions
In case you want to filter nodes based on conditions, you can also use LINQ to perform any kind of query on the nodes and return your specific nodes. For example, the below code will return all hyperlinks where the anchor tags contain ‘div2‘ in their link text.
HtmlDocument document2 = new HtmlDocument(); document2.Load(@"C:\Temp\sample.txt"); HtmlNode[] nodes = document2.DocumentNode.SelectNodes("//a").Where(x=>x.InnerHtml.Contains("div2")).ToArray(); foreach (HtmlNode item in nodes) { Console.WriteLine(item.InnerHtml); }
The above code will give the following output –
Hope this post gives you a head start with HTML Agility Pack. If you have any questions or would like me to provide some support using this, please connect with me here.