Tag Archives: Latest in Java

HtmlUnit Example for html parsing in Java

In continuation of my earlier blog HtmlUnit vs JSoup, in this blog, I will show you how to write a simple web scraping sample using HtmlUnit. This example will parse html data and get unstructured Web data in a structured format.

In this simple example, we will connect to Wikipedia and get list of all movies and their wikepedia source links. The page looks as below,

HtmlUnit: Screen awards movie list

HtmlUnit: Screen awards movie list

As always let us start with a maven dependency entry in our pom.xml to include HtmlUnit as below,

<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>2.11</version>
</dependency>

Again we will start with a simple JUnit testcase as below,

@Test
public void testBestMovieList() throws FailingHttpStatusCodeException, MalformedURLException, IOException {

final WebClient webClient = new WebClient();
final HtmlPage startPage = webClient.getPage("http://en.wikipedia.org/wiki/Screen_Award_for_Best_Film");

String source = "/html/body/div[3]/div[3]/div[4]/table[2]/tbody/tr[:?:]/td[2]/i/a/@href";
String[] sourceArr = source.split(":");

String title = "/html/body/div[3]/div[3]/div[4]/table[2]/tbody/tr[:?:]/td[2]/i/a/@title";
String[] titleArr = title.split(":");

String titleData = titleArr[0] + 2 + titleArr[2];
String sourceData = sourceArr[0] + 2 + sourceArr[2];
List<DomNode> titleNodes = (List<DomNode>) startPage.getByXPath(titleData);
assertTrue(titleNodes.size() > 0);
List<DomNode> sourceNodes = (List<DomNode>) startPage.getByXPath(sourceData);
assertTrue(sourceNodes.size() > 0);
assertEquals("Hum Aapke Hain Kaun", titleNodes.get(0).getNodeValue());
assertEquals("/wiki/Hum_Aapke_Hain_Kaun", sourceNodes.get(0).getNodeValue());

titleData = titleArr[0] + 3 + titleArr[2];
sourceData = sourceArr[0] + 3 + sourceArr[2];
titleNodes = (List<DomNode>) startPage.getByXPath(titleData);
assertTrue(titleNodes.size() > 0);
sourceNodes = (List<DomNode>) startPage.getByXPath(sourceData);
assertTrue(sourceNodes.size() > 0);
assertEquals("Dilwale Dulhaniya Le Jayenge", titleNodes.get(0).getNodeValue());
assertEquals("/wiki/Dilwale_Dulhaniya_Le_Jayenge", sourceNodes.get(0).getNodeValue());
}

If you notice I am accessing the page http://en.wikipedia.org/wiki/Screen_Award_for_Best_Film which looks as per the above diagram. We are getting the 1st and 2nd movies on the page and JUnit assert for the same and the test succeeds. If you also notice I am using the XPaths to access the elements like /html/body/div[3]/div[3]/div[4]/table[2]/tbody/tr[2]/td[2]/i/a/@title. The way I am extracting the XPath is to use Firebug as per this blog HtmlUnit vs JSoup.

I hope this blog helped you.

HtmlUnit vs JSoup: html parsing in Java

In continuation of my earlier blog Jsoup: nice way to do HTML parsing in Java, in this blog I will compare JSoup with other similar framework, HtmlUnit. Apparently both of them are good Html parsing frameworks and both can be used for web application unit testing and web scraping. In this blog, I will explain how HtmlUnit is better suited for web application unit testing automation and JSoup is better suited for Web Scraping.

Typically web application unit testing automation is a way to automate webtesting in JUnit framework. And web scraping is a way to extract unstructured information from the web to a structured format. I recently tried 2 decent web scraping tools, Webharvy and Mozenda.

For any good Html Parsing tools to click, they should support either XPath based or CSS Selector based element access. There are lot of blogs comparing each one like, Why CSS Locators are the way to go vs XPath, and CSS Selectors And XPath Expressions.

HtmlUnit

HtmlUnit is a powerful framework, where you can simulate pretty much anything a browser can do like click events, submit events etc and is ideal for Web application automated unit testing.

XPath based parsing is simple and most popular and HtmlUnit is heavily based on this. In one of my application, I wanted to extract information from the web in a structured way. HtmlUnit worked out very well for me on this. But the problem starts when you try to extract structured data from modern web applications that use JQuery and other Ajax features and use Div tags extensively. HtmlUnit and other XPath based html parsers will not work with this. There is also a JSoup version that supports XPath based on Jaxen, I tried this as well, guess what? it also was not able to access the data from modern web applications like ebay.com.

Finally my experience with HtmlUnit was it was bit buggy or maybe I call it unforgiving unlike a browser, where in if the target web applications have missing javascripts, it will throw exceptions, but we can get around this, but out of the box it will not work.

JSoup

The latest version of JSoup goes extra length not to support XPath and will very well support CSS Selectors. My experience was it is excellent for extracting structured data from modern web applications. It is also far forgiving if the web application has some missing javascripts.

Extracting XPath and CSS Selector data

In most of the browsers, if you point to an element and right click and click on “Inspect element” it can extract the XPath information, I noticed Firefox/Firebug can also extract CSS Selector Path as shown below,

HtmlUnit vs JSoup: Extract CSS Path and XPath in FireBug

HtmlUnit vs JSoup: Extract CSS Path and XPath in FireBug

I hope this blog helped.

Jsoup: nice way to do HTML parsing in Java

Typically you do HTML parsing in Java for various reasons like JUnit testing, Web Crawling and others. I stumbled across JSoup and tried few things to understand its capabilities. If you do some googling you can come across few good articles in Stackoverflow like, What is a good java web crawler library? and JSoup vs HttpUnit.

I had already worked with HttpUnit extensively. I felt that JSoup is better than HttpUnit. Let me demonstrate few of the capabilities of Jsoup in this blog,

Connecting to any website and parsing the data from that website into a DOM tree is as simple as,

URL url = new URL("http://gosmarter.net?query=cars");
Document doc = Jsoup.parse(url, 3000);

Where the integer value passed in the parse method is the timeout period set to return downloading from the site if it takes more time.

If you want to retrieve a table or a div from the DOM tree you do as below,

Iterator<Element> productList = doc.select("div[class=productList]").iterator();
assertNotNull(productList.hasNext);
while (productList.hasNext()) {
//Do some processing
}

If you want to extract an Image URL you do this way,

Element productLink = product.select("a").first();
String href = productLink.attr("abs:href");

Note in the above code, “abs:href”, will return the absolute url if the path is relative. Also the Element class is jsoup class, this has capabilities like select method, which is used to query based on intelligent jsoup query language. It also has a attr method, where, for a given element we can retrieve a specific attribute, in this example, we are retrieving href attribute of “a” link html tag. The first method returns always the 1st element, if there are lot of “td” or “tr” or a “li” html tag.

You can also get a specific element in a “td” or a “tr” or a “li” html tag as below,

Element descLi = product.select( "li:eq(0)").first();

Note above the select query is requesting 1st element or 0 index element from the list. The syntax is like “li:eq(0)”.

You can retrieve the text within a tag, for example, if you want to retrieve the text in the “a” link html tag, you do as below,

Element descA = product.select( "a").first();
String desc = descA.text();

Note text method is used to retrieve the text.

Finally if you want to retrieve an entire html content of a element you can do as below,

Element descA = product.select( "a").first();
String descHtmlData = descA.html();

Note you use html method to achieve retrieving html content of an element. This is useful for debug purpose.

There is also maven jar available in Apache Maven repository as below,

<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.7.1</version>
</dependency>

I hope this blog helped you.

Enabling CLIENT-CERT based authorization on Tomcat

In continuation with my earlier Blog Enabling SSL on Tomcat, in this blog I will go to next step and enable CLIENT-CERT based authorization on Tomcat. Again if you want to tryout the code go to my Github and download the code.

For this sample, I assume that you have tried my earlier SSL example on Tomcat and have the setup. As per the SSL example I assume,

  • You have setup Tomcat 6.0 version
  • You have set the SSL Connector Configuration in Tomcat server.xml
  • You have started the Tomcat server and run the SecureHttpClient0Test test

In this blog, I will show you how to,

Setup MemoryRealm

In the server.xml comment the Realm tag and replace that with the code below,

<Realm className="org.apache.catalina.realm.MemoryRealm" />

Setup user role setup

In <tomcat home>/conf/tomcat-users.xml

<role rolename="secureconn"/>
<user username="CN=client1, OU=Application Development, O=GoSmarter, L=Bangalore, ST=KA, C=IN" password="null"  roles="secureconn"/>

Setup security-contraint

Add access control in the individual application web.xml as below,

<security-constraint>
<web-resource-collection>
<web-resource-name>Demo App</web-resource-name>
<url-pattern>/secure/*</url-pattern>
<http-method>GET</http-method>
</web-resource-collection>
<auth-constraint>
<role-name>secureconn</role-name>
</auth-constraint>
</security-constraint>
<login-config>
<auth-method>CLIENT-CERT</auth-method>
<realm-name>Demo App</realm-name>
</login-config>
<security-role>
<role-name>secureconn</role-name>
</security-role>

Run JUnit test

Open the class src/test/java/com/goSmarter/test/SecureHttpClient1Test.java file and change the below code to point to <tomcat home>/conf folder

public static final String path = "D:/apache-tomcat-6.0.36/conf/";

Start the Tomcat and run the JUnit test using “mvn test -Dtest=”com.goSmarter.test.SecureHttpClient1Test”

If you want to debug the Realm, you need to increase the log level for Realm in <tomcat-home>/conf/logging.properties as below,

org.apache.catalina.realm.level = ALL
org.apache.catalina.realm.useParentHandlers = true
org.apache.catalina.authenticator.level = ALL
org.apache.catalina.authenticator.useParentHandlers = true

If you notice there are 2 positive tests and 1 negative test, negative test will give a forbidden 403 return status when a wrong certificate is sent based on the security-constraint. I hope this blog helped you.

Enabling SSL in Tomcat

For people in hurry get the latest code and follow the steps mentioned in Github.

There are lots of documents on the web on how to configure SSL in Tomcat. Tomcat Server/Client Self-Signed SSL Certificate and Mutual Authentication with CLIENT-CERT, Tomcat 6, and HttpClient stand out. But there no simple example, where we can demonstrate Enabling SSL in Tomcat, I spent days pouring documents and Googling before I got the perfect solution. In this blog I have demonstrated using a simple Java Keystore to achieve 2 way handshake. In my next blog I will show you how to use security-constraint to achieve CLIENT-CERT based access control.

This sample only works with Tomcat 6.0. Download and unzip the zip file in a location and go to <tomcat-home>/conf location and copy the 2 batch files client1cert.bat and client2cert.bat. Run both the files in that order they will create all the necessary certificates required for 2 way handshake.

Open server.xml and replace the <Connector> tag with the one below,

<Connector
clientAuth="true" port="8443" minSpareThreads="5" maxSpareThreads="75"
enableLookups="true" disableUploadTimeout="true"
acceptCount="100" maxThreads="200"
scheme="https" secure="true" SSLEnabled="true"
keystoreFile="${catalina.base}/conf/server.jks"
keystoreType="JKS" keystorePass="password"
truststoreFile="${catalina.base}/conf/server.jks"
truststoreType="JKS" truststorePass="password"
SSLVerifyClient="require" SSLEngine="on" SSLVerifyDepth="2" sslProtocol="TLS" />

If you notice the clientAuth=”true” enabled.

Copy the client0 folder to <tomcat-home>/webapp directory. Finally start the server. Now under the sourcecode folder, go to, client-cert-test open the file src/main/java/com/goSmarter/test/SecureHttpClient0Test.java file and change the below line to point to your <tomcat home>/conf location,


public static final String path = "D:/apache-tomcat-6.0.36/conf/";

Run “mvn test -Dtest=com.goSmarter.test.SecureHttpClient0Test”. You notice that 1 test succeeded. If testcase passed it means, 2 way SSL is working correctly. Please looks at the code and understand the flow. The JUnit test uses HttpUnit api to access the secure webserver. You will also notice when you run the test, there are lot of certificate related messages on the console. For this to appear, I have turned on Client side SSL debugging by putting the below code in SecureHttpClient0Test.java class,

static {
System.setProperty("javax.net.debug", "ssl");
}

I hope this blog helped you.

Harnessing New Java Web Development stack: Play 2.0, Akka, Comet

For people in hurry, here is the code and some steps to run few demo samples.

Disclaimer: I am still learning Play 2.0, please point to me if something is incorrect.

Play 2.0 is a web application stack that bundled with Netty for HTTP ServerAkka for loosely coupled backend processing and Comet / Websocket for asynchronous browser rendering. Play 2.0 itself does not do any session state management, but uses cookies to manage User Sessions and Flash data. Play 2.0 advocates Reactive Model based on Iteratee IO. Please also see my blog on how Play 2.0 pits against Spring MVC.

In this blog, I will discuss some of these points and also discuss how Akka and Comet complement Play 2.0. The more I understand Play 2.0 stack the more I realize that Scala is better suited to take advantages of capabilities of Play 2.0 compared to Java. There is a blog on how Web developers view of Play 2.0. You can understand how Akka’s Actor pits against JMS refer this Stackoverflow writeup. A good documentation on Akka’s actor is here.

Play 2.0, Netty, Akka, Commet: how it fits

Play 2.0, Netty, Akka, Comet: How it fits

Play 2.0, Netty, Akka, Comet: How it fits

Servlet container like Tomcat blocks each request until the backend processing is complete. Play 2.0 stack will help in achieving the usecase like, you need to web crawl and get all the product listing from various sources in a non-blocking and asynchronous way using loosely coupled message oriented architecture.

For example, the below code will not be scalable in Play 2.0 stack, because Play has only 1 main thread and the code blocks other requests to be processed. In Play 2.0/Netty the application registers with callback on a long running process using frameworks like Akka when it is completed, in a reactive pattern.

public static Result index() {
//Here is where you can put your long running blocking code like getting
// the product feed from various sources
return ok("Hello world");
}

The controller code to use Akka to work in a non-blocking way with async callback is as below,

public static Result index() {
return async(
future(new Callable<Integer>() {
public Integer call() {
//Here is where you can put your long running blocking code like getting
//the product feed from various sources

return 4;
}
}).map(new Function<Integer,Result>() {
public Result apply(Integer i) {

ObjectNode result = Json.newObject();

result.put("id", i);
return ok(result);
}
})
);
}

And more cleaner and preferred way is Akka’s Actor model is as below,

public static Result sayHello(String data) {

Logger.debug("Got the request: {}" + data);

ActorSystem system = ActorSystem.create("MySystem");
ActorRef myActor = system.actorOf(new Props(MyUntypedActor.class), "myactor");

return async(
Akka.asPromise(ask(myActor, data, 1000)).map(
new Function<Object,Result>() {
public Result apply(Object response) {
ObjectNode result = Json.newObject();

result.put("message", response.toString());
return ok(result);
}
}
)
);
}

static public class MyUntypedActor extends UntypedActor {

public void onReceive(Object message) throws Exception {
if (message instanceof String){
Logger.debug("Received String message: {}" + message);

//Here is where you can put your long running blocking code like getting
//the product feed from various sources

getSender().tell("Hello world");
}
else {
unhandled(message);
}
}
}

If you want to understand how we can use Comet for asynchronously render data to the browser using Play, Akka and Comet refer the code in Github. Here is some good writeup comparing Comet and Websocket in Stackoverflow.

I hope this blog helped you.

Calling a Webservice from Java using Maven

For people in hurry get the latest code @ github and run “mvn test”

Introduction

Typically, when you call a Webservice from Java, you need to create a Stub class and use that stub class to call the Webservice in Java. The Stub class will do the Marshalling of the data and send that data to the server. As a best practices,

  • The stub class should be generated by a tool
  • We need to build a JUnit test to make sure you test the generated stub for its method validity, for example if the method signature is changed
  • The Stub class is always generated and is never checked into version control so that your code quality tool will not measure the generated code

Details: Webservices with Maven

In this blog I will show you how to call a Webservice from Java, by having a JAXWS plugin in Maven and generating the stubs during build time. In this blog I am calling a standard Currency Conversion Webservice (http://www.webservicex.net/CurrencyConvertor.asmx?WSDL), where in you pass a source currency and target currency and it will return the exchange rate in real time. The Maven config is as below,

<build>
<plugins>
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>jaxws-maven-plugin</artifactId>
<version>1.9</version>
<executions>
<execution>
<goals>
<goal>wsimport</goal>
</goals>
<phase>generate-sources</phase>
</execution>
</executions>
<configuration>
<wsdlUrls>
<wsdlUrl>
http://www.webservicex.net/CurrencyConvertor.asmx?WSDL
</wsdlUrl>
</wsdlUrls>
</configuration>
</plugin>
</plugins>
</build>

When you put the above jaxws plugin and run any Maven command like test, package, it generates the stubs in target/jaxws/wsimport/java folder. The Jaxws plugin is configurable to create stubs in any folder.

If you see the JUnit test it looks as below,

@Test
public void test() {
CurrencyConvertor currencyConvertor = new CurrencyConvertor();

assertTrue(currencyConvertor.getCurrencyConvertorSoap().conversionRate(Currency.INR, Currency.USD) >  0.0D);
}

If you want to test this in STSIDE, maven import the project and add build path for the source to target/jaxws/wsimport/java. Once you have it, you can Run as -> JUnit test, and the test will be successful.

I hope you like it.