In continuation of my earlier blog HtmlUnit vs JSoup, in this blog, I will show you how to write a simple web scraping sample using HtmlUnit. This example will parse html data and get unstructured Web data in a structured format.

In this simple example, we will connect to Wikipedia and get list of all movies and their wikepedia source links. The page looks as below,

HtmlUnit: Screen awards movie list

As always let us start with a maven dependency entry in our pom.xml to include HtmlUnit as below,


Again we will start with a simple JUnit testcase as below,

public void testBestMovieList() throws FailingHttpStatusCodeException, MalformedURLException, IOException {

final WebClient webClient = new WebClient();
final HtmlPage startPage = webClient.getPage("");

String source = "/html/body/div[3]/div[3]/div[4]/table[2]/tbody/tr[:?:]/td[2]/i/a/@href";
String[] sourceArr = source.split(":");

String title = "/html/body/div[3]/div[3]/div[4]/table[2]/tbody/tr[:?:]/td[2]/i/a/@title";
String[] titleArr = title.split(":");

String titleData = titleArr[0] + 2 + titleArr[2];
String sourceData = sourceArr[0] + 2 + sourceArr[2];
List<DomNode> titleNodes = (List<DomNode>) startPage.getByXPath(titleData);
assertTrue(titleNodes.size() > 0);
List<DomNode> sourceNodes = (List<DomNode>) startPage.getByXPath(sourceData);
assertTrue(sourceNodes.size() > 0);
assertEquals("Hum Aapke Hain Kaun", titleNodes.get(0).getNodeValue());
assertEquals("/wiki/Hum_Aapke_Hain_Kaun", sourceNodes.get(0).getNodeValue());

titleData = titleArr[0] + 3 + titleArr[2];
sourceData = sourceArr[0] + 3 + sourceArr[2];
titleNodes = (List<DomNode>) startPage.getByXPath(titleData);
assertTrue(titleNodes.size() > 0);
sourceNodes = (List<DomNode>) startPage.getByXPath(sourceData);
assertTrue(sourceNodes.size() > 0);
assertEquals("Dilwale Dulhaniya Le Jayenge", titleNodes.get(0).getNodeValue());
assertEquals("/wiki/Dilwale_Dulhaniya_Le_Jayenge", sourceNodes.get(0).getNodeValue());

If you notice I am accessing the page which looks as per the above diagram. We are getting the 1st and 2nd movies on the page and JUnit assert for the same and the test succeeds. If you also notice I am using the XPaths to access the elements like /html/body/div[3]/div[3]/div[4]/table[2]/tbody/tr[2]/td[2]/i/a/@title. The way I am extracting the XPath is to use Firebug as per this blog HtmlUnit vs JSoup.

I hope this blog helped you.

