HtmlUnit Example for html parsing in Java

In continuation of my earlier blog HtmlUnit vs JSoup, in this blog, I will show you how to write a simple web scraping sample using HtmlUnit. This example will parse html data and get unstructured Web data in a structured format.

In this simple example, we will connect to Wikipedia and get list of all movies and their wikepedia source links. The page looks as below,

HtmlUnit: Screen awards movie list

HtmlUnit: Screen awards movie list

As always let us start with a maven dependency entry in our pom.xml to include HtmlUnit as below,

<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>2.11</version>
</dependency>

Again we will start with a simple JUnit testcase as below,

@Test
public void testBestMovieList() throws FailingHttpStatusCodeException, MalformedURLException, IOException {

final WebClient webClient = new WebClient();
final HtmlPage startPage = webClient.getPage("http://en.wikipedia.org/wiki/Screen_Award_for_Best_Film");

String source = "/html/body/div[3]/div[3]/div[4]/table[2]/tbody/tr[:?:]/td[2]/i/a/@href";
String[] sourceArr = source.split(":");

String title = "/html/body/div[3]/div[3]/div[4]/table[2]/tbody/tr[:?:]/td[2]/i/a/@title";
String[] titleArr = title.split(":");

String titleData = titleArr[0] + 2 + titleArr[2];
String sourceData = sourceArr[0] + 2 + sourceArr[2];
List<DomNode> titleNodes = (List<DomNode>) startPage.getByXPath(titleData);
assertTrue(titleNodes.size() > 0);
List<DomNode> sourceNodes = (List<DomNode>) startPage.getByXPath(sourceData);
assertTrue(sourceNodes.size() > 0);
assertEquals("Hum Aapke Hain Kaun", titleNodes.get(0).getNodeValue());
assertEquals("/wiki/Hum_Aapke_Hain_Kaun", sourceNodes.get(0).getNodeValue());

titleData = titleArr[0] + 3 + titleArr[2];
sourceData = sourceArr[0] + 3 + sourceArr[2];
titleNodes = (List<DomNode>) startPage.getByXPath(titleData);
assertTrue(titleNodes.size() > 0);
sourceNodes = (List<DomNode>) startPage.getByXPath(sourceData);
assertTrue(sourceNodes.size() > 0);
assertEquals("Dilwale Dulhaniya Le Jayenge", titleNodes.get(0).getNodeValue());
assertEquals("/wiki/Dilwale_Dulhaniya_Le_Jayenge", sourceNodes.get(0).getNodeValue());
}

If you notice I am accessing the page http://en.wikipedia.org/wiki/Screen_Award_for_Best_Film which looks as per the above diagram. We are getting the 1st and 2nd movies on the page and JUnit assert for the same and the test succeeds. If you also notice I am using the XPaths to access the elements like /html/body/div[3]/div[3]/div[4]/table[2]/tbody/tr[2]/td[2]/i/a/@title. The way I am extracting the XPath is to use Firebug as per this blog HtmlUnit vs JSoup.

I hope this blog helped you.

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s