In continuation of my earlier blog Jsoup: nice way to do HTML parsing in Java, in this blog I will compare JSoup with other similar framework, HtmlUnit. Apparently both of them are good Html parsing frameworks and both can be used for web application unit testing and web scraping. In this blog, I will explain how HtmlUnit is better suited for web application unit testing automation and JSoup is better suited for Web Scraping.
Typically web application unit testing automation is a way to automate webtesting in JUnit framework. And web scraping is a way to extract unstructured information from the web to a structured format. I recently tried 2 decent web scraping tools, Webharvy and Mozenda.
For any good Html Parsing tools to click, they should support either XPath based or CSS Selector based element access. There are lot of blogs comparing each one like, Why CSS Locators are the way to go vs XPath, and CSS Selectors And XPath Expressions.
HtmlUnit is a powerful framework, where you can simulate pretty much anything a browser can do like click events, submit events etc and is ideal for Web application automated unit testing.
XPath based parsing is simple and most popular and HtmlUnit is heavily based on this. In one of my application, I wanted to extract information from the web in a structured way. HtmlUnit worked out very well for me on this. But the problem starts when you try to extract structured data from modern web applications that use JQuery and other Ajax features and use Div tags extensively. HtmlUnit and other XPath based html parsers will not work with this. There is also a JSoup version that supports XPath based on Jaxen, I tried this as well, guess what? it also was not able to access the data from modern web applications like ebay.com.
Extracting XPath and CSS Selector data
In most of the browsers, if you point to an element and right click and click on “Inspect element” it can extract the XPath information, I noticed Firefox/Firebug can also extract CSS Selector Path as shown below,
I hope this blog helped.