Web scraping is fun and very useful. There is a lot of information on the internet and creating applications that use this information is fun. And there are great tools to do this. If you are using C# as I am, a great one is Html Agility Pack (HAP). Let’s see how it works.
First, let’s create our project and install the Html Agility NuGet in your project (I’m using dotnet core for my project, but I’m sure it works in other versions):
> dotnet new console > dotnet add package HtmlAgilityPack
I’m not going to get into the legal aspects of scraping, so beware of what you do. Also, a “risky” thing about web scraping is that you must know the structure of the page to be able to extract its content. This can be done by inspecting the site using a browser but is prone to break easily when the site changes. But enough warnings! Let’s start coding.
For this example, I’m going to use a copy Wikipedia’s Web scraping page and extract the names of the possible scraping techniques shown there (I read their content reuse guidelines and this looks fine, but if someone from Wikipedia is reading this and doesn’t like it please contact me!)
The first method I use is finding the HTML header that begins this section. Thankfully it has an Id, which is unique to all elements on the page. After this, I traverse the HTML node tree searching for containers of class mw-headline
which contain the titles of the scraping techniques. The note traversing is a bit complex because we must go up and down the tree carefully, but overall it works:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
public static void Scrape() | |
{ | |
var scraper = new HtmlWeb(); | |
var page = scraper.Load("https://vainolo.z14.web.core.windows.net/WebScraping.html"); | |
var techniquesTitle = page.GetElementbyId("Techniques"); | |
var currNode = techniquesTitle.ParentNode.NextSibling; | |
while(currNode.Name != "h2") | |
{ | |
if(currNode.GetClasses().Contains("mw-headline")) | |
{ | |
var headline = currNode.InnerText; | |
Console.WriteLine(headline); | |
} | |
if(currNode.HasChildNodes) | |
{ | |
currNode = currNode.FirstChild; | |
} | |
else if(currNode == currNode.ParentNode.LastChild) | |
{ | |
while(currNode.ParentNode.NextSibling == null) | |
{ | |
currNode = currNode.ParentNode; | |
} | |
currNode = currNode.ParentNode.NextSibling; | |
} | |
else | |
{ | |
currNode = currNode.NextSibling; | |
} | |
} | |
} |
A second, shorter and IMHO more elegant way to do this is by using Linq (which is an awesome .NET feature) to find the starting header for searching, taking all the nodes until there is a new header, and then printing only the nodes that are headlines:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
public static void Scrape() | |
{ | |
var scraper = new HtmlWeb(); | |
var page = scraper.Load("https://vainolo.z14.web.core.windows.net/WebScraping.html"); | |
var nodes = page.DocumentNode.Descendants().SkipWhile(e => e.Id != "Techniques").Skip(1).TakeWhile(e => e.Name != "h2"); | |
foreach (var currNode in nodes) | |
{ | |
if(currNode.GetClasses().Contains("mw-headline")) | |
{ | |
var headline = currNode.InnerText; | |
Console.WriteLine(headline); | |
} | |
} | |
} |
Short, simple, and powerful.
A full project with all the code can be found in my Github playground. Until next time, happy coding!