Monthly Archives: December 2012

HtmlUnit Example for html parsing in Java

In continuation of my earlier blog HtmlUnit vs JSoup, in this blog, I will show you how to write a simple web scraping sample using HtmlUnit. This example will parse html data and get unstructured Web data in a structured format.

In this simple example, we will connect to Wikipedia and get list of all movies and their wikepedia source links. The page looks as below,

HtmlUnit: Screen awards movie list

HtmlUnit: Screen awards movie list

As always let us start with a maven dependency entry in our pom.xml to include HtmlUnit as below,

<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>2.11</version>
</dependency>

Again we will start with a simple JUnit testcase as below,

@Test
public void testBestMovieList() throws FailingHttpStatusCodeException, MalformedURLException, IOException {

final WebClient webClient = new WebClient();
final HtmlPage startPage = webClient.getPage("http://en.wikipedia.org/wiki/Screen_Award_for_Best_Film");

String source = "/html/body/div[3]/div[3]/div[4]/table[2]/tbody/tr[:?:]/td[2]/i/a/@href";
String[] sourceArr = source.split(":");

String title = "/html/body/div[3]/div[3]/div[4]/table[2]/tbody/tr[:?:]/td[2]/i/a/@title";
String[] titleArr = title.split(":");

String titleData = titleArr[0] + 2 + titleArr[2];
String sourceData = sourceArr[0] + 2 + sourceArr[2];
List<DomNode> titleNodes = (List<DomNode>) startPage.getByXPath(titleData);
assertTrue(titleNodes.size() > 0);
List<DomNode> sourceNodes = (List<DomNode>) startPage.getByXPath(sourceData);
assertTrue(sourceNodes.size() > 0);
assertEquals("Hum Aapke Hain Kaun", titleNodes.get(0).getNodeValue());
assertEquals("/wiki/Hum_Aapke_Hain_Kaun", sourceNodes.get(0).getNodeValue());

titleData = titleArr[0] + 3 + titleArr[2];
sourceData = sourceArr[0] + 3 + sourceArr[2];
titleNodes = (List<DomNode>) startPage.getByXPath(titleData);
assertTrue(titleNodes.size() > 0);
sourceNodes = (List<DomNode>) startPage.getByXPath(sourceData);
assertTrue(sourceNodes.size() > 0);
assertEquals("Dilwale Dulhaniya Le Jayenge", titleNodes.get(0).getNodeValue());
assertEquals("/wiki/Dilwale_Dulhaniya_Le_Jayenge", sourceNodes.get(0).getNodeValue());
}

If you notice I am accessing the page http://en.wikipedia.org/wiki/Screen_Award_for_Best_Film which looks as per the above diagram. We are getting the 1st and 2nd movies on the page and JUnit assert for the same and the test succeeds. If you also notice I am using the XPaths to access the elements like /html/body/div[3]/div[3]/div[4]/table[2]/tbody/tr[2]/td[2]/i/a/@title. The way I am extracting the XPath is to use Firebug as per this blog HtmlUnit vs JSoup.

I hope this blog helped you.

HtmlUnit vs JSoup: html parsing in Java

In continuation of my earlier blog Jsoup: nice way to do HTML parsing in Java, in this blog I will compare JSoup with other similar framework, HtmlUnit. Apparently both of them are good Html parsing frameworks and both can be used for web application unit testing and web scraping. In this blog, I will explain how HtmlUnit is better suited for web application unit testing automation and JSoup is better suited for Web Scraping.

Typically web application unit testing automation is a way to automate webtesting in JUnit framework. And web scraping is a way to extract unstructured information from the web to a structured format. I recently tried 2 decent web scraping tools, Webharvy and Mozenda.

For any good Html Parsing tools to click, they should support either XPath based or CSS Selector based element access. There are lot of blogs comparing each one like, Why CSS Locators are the way to go vs XPath, and CSS Selectors And XPath Expressions.

HtmlUnit

HtmlUnit is a powerful framework, where you can simulate pretty much anything a browser can do like click events, submit events etc and is ideal for Web application automated unit testing.

XPath based parsing is simple and most popular and HtmlUnit is heavily based on this. In one of my application, I wanted to extract information from the web in a structured way. HtmlUnit worked out very well for me on this. But the problem starts when you try to extract structured data from modern web applications that use JQuery and other Ajax features and use Div tags extensively. HtmlUnit and other XPath based html parsers will not work with this. There is also a JSoup version that supports XPath based on Jaxen, I tried this as well, guess what? it also was not able to access the data from modern web applications like ebay.com.

Finally my experience with HtmlUnit was it was bit buggy or maybe I call it unforgiving unlike a browser, where in if the target web applications have missing javascripts, it will throw exceptions, but we can get around this, but out of the box it will not work.

JSoup

The latest version of JSoup goes extra length not to support XPath and will very well support CSS Selectors. My experience was it is excellent for extracting structured data from modern web applications. It is also far forgiving if the web application has some missing javascripts.

Extracting XPath and CSS Selector data

In most of the browsers, if you point to an element and right click and click on “Inspect element” it can extract the XPath information, I noticed Firefox/Firebug can also extract CSS Selector Path as shown below,

HtmlUnit vs JSoup: Extract CSS Path and XPath in FireBug

HtmlUnit vs JSoup: Extract CSS Path and XPath in FireBug

I hope this blog helped.

Spring Integration FakeFtpServer example

For people in hurry, get the latest code and the steps in GitHub.

To run the junit test, run “mvn test” and understand the test flow.

Introduction: FakeFtpServer

In this Spring Integration FakeFtpServer example, I will demonstrate using Spring FakeFtpServer to JUnit test a Spring Integration flow. This is a interesting topic, there are few articles like Unit testing file transfers, which gives some insight on this topic.

In this blog, we will test a Spring Integration flow which checks for a list of files, apply a splitter to separate each file and start downloading them into local location, once the download is complete, it will delete the files on the FTP server. In my next blog, I will show how to do JUnit testing of Spring Integration flow with SFTP Server.

Spring Integration flow

Spring Integration FakeFtpServer example

Spring Integration FakeFtpServer example

In order to use FakeFtpServer we need to have Maven dependency as below,


<dependency>
<groupId>org.mockftpserver</groupId>
<artifactId>MockFtpServer</artifactId>
<version>2.3</version>
<scope>test</scope>
</dependency>

The first step to this is to create a FakeFtpServer before every test runs as below,


@Before
public void setUp() throws Exception {
fakeFtpServer = new FakeFtpServer();
fakeFtpServer.setServerControlPort(9999); // use any free port
FileSystem fileSystem = new UnixFakeFileSystem();
fileSystem.add(new FileEntry(FILE, CONTENTS));
fakeFtpServer.setFileSystem(fileSystem);
UserAccount userAccount = new UserAccount("user", "password", HOME_DIR);
fakeFtpServer.addUserAccount(userAccount);
fakeFtpServer.start();
}

@After
public void tearDown() throws Exception {
fakeFtpServer.stop();
}

Finally run the JUnit test case as below,

    @Autowired
private FileDownloadUtil downloadUtil;
@Test
public void testFtpDownload() throws Exception {
File file = new File("src/test/resources/output");
delete(file);
FTPClient client = new FTPClient();
client.connect("localhost", 9999);
client.login("user", "password");
String files[] = client.listNames("/dir");
client.help();
logger.debug("Before delete" + files[0]);
assertEquals(1, files.length);
downloadUtil.downloadFilesFromRemoteDirectory();
logger.debug("After delete");
files = client.listNames("/dir");
client.help();
assertEquals(0, files.length);
assertEquals(1, file.list().length);
}

I hope this blog helped.

Spring Integration Mock SftpServer example

In continuation of my earlier blog Spring Integration FakeFtpServer example in this example I will show how to test Spring Integration flow using Mock SftpServer. There are few good writeup on the net including the Stackoverflow writeup, Using Apache Mina as a Mock/In Memory SFTP Server for Unit Testing.The code for this blog is @ Spring Integration flow to test Ftp/Sftp server.

To run the junit test, run “mvn test” and understand the test flow.

Again talking of the same spring integration flow as mentioned in my earlier blog, I will write test for sftp server,

Spring Integration Mock SftpServer example

Spring Integration Mock SftpServer example

The maven dependency for this is as below,


<dependency>
<groupId>org.apache.sshd</groupId>
<artifactId>sshd-core</artifactId>
<version>0.5.0</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.6.1</version>
</dependency>
<dependency>
<groupId>com.jcraft</groupId>
<artifactId>jsch</artifactId>
<version>0.1.49</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.2</version>
</dependency>

The startup and teardown of the junit is as below,


@Before
public void beforeTestSetup() throws Exception {
sshd = SshServer.setUpDefaultServer();
sshd.setPort(22999);

sshd.setKeyPairProvider(new SimpleGeneratorHostKeyProvider("hostkey.ser"));
sshd.setPasswordAuthenticator(new PasswordAuthenticator() {

public boolean authenticate(String username, String password, ServerSession session) {
// TODO Auto-generated method stub
return true;
}
});

CommandFactory myCommandFactory = new CommandFactory() {

public Command createCommand(String command) {
System.out.println("Command: " + command);
return null;
}
};
sshd.setCommandFactory(new ScpCommandFactory(myCommandFactory));

List<NamedFactory<command>> namedFactoryList = new ArrayList<NamedFactory<command>>(); namedFactoryList.add(new SftpSubsystem.Factory()); sshd.setSubsystemFactories(namedFactoryList); sshd.start(); }

@After

public void teardown() throws Exception { sshd.stop(); }

The Junit test is as below,

public void testPutAndGetFile() throws Exception {
JSch jsch = new JSch();

Hashtable config = new Hashtable();
config.put("StrictHostKeyChecking", "no");
JSch.setConfig(config);

Session session = jsch.getSession("remote-username", "localhost", 22999);
session.setPassword("remote-password");

session.connect();

Channel channel = session.openChannel("sftp");
channel.connect();

ChannelSftp sftpChannel = (ChannelSftp) channel;

final String testFileContents = "some file contents";

String uploadedFileName = "uploadFile";
sftpChannel.put(new ByteArrayInputStream(testFileContents.getBytes()), uploadedFileName);

String downloadedFileName = "downLoadFile";
sftpChannel.get(uploadedFileName, downloadedFileName);

File downloadedFile = new File(downloadedFileName);
assertTrue(downloadedFile.exists());

String fileData = getFileContents(downloadedFile);

assertEquals(testFileContents, fileData);

if (sftpChannel.isConnected()) {
sftpChannel.exit();
logger.debug("Disconnected channel");
}

if (session.isConnected()) {
session.disconnect();
logger.debug("Disconnected session");
}

}

I hope this blog helped you.

Container based Security and Spring Security

One of the materials on internet that talks about the differences between Container based Security framework and Spring Security is Spring Security FAQ. It lays down the power of Spring Security. Spring MVC based application and other Spring Based application can take advantage of Spring Security

Authentication is a way to provide user identity so that the application identifies who logged into the system. Authorization is a way to tell who can access which part of the application.

As mentioned in that material Container offers Realm based authentication that resides within a containers server.xml vs as opposed to Spring Security that offers Authentication Providers that sits in the application config file,

A simple Realm is Tomcat’s MemoryRealm and it depends on tomcat-users.xml to configure the users, in reality we use LDAP or any database to store user information.

<!-- in the server.xml -->
<Realm className="org.apache.catalina.realm.MemoryRealm" />

<!-- in the tomcat-users.xml -->
<role rolename="secureconn"/>
<user username="client" password="password" roles="secureconn"/>
<role rolename="secureconn1"/>
<user username="client1" password="password" roles="secureconn1"/>

A simple Authentication Provider looks as below, and it is not specific to a container. Again in reality the user information will be in LDAP or Database.

<authentication-manager>
<authentication-provider>
<!-- <password-encoder ref="encoder"/>-->
<user-service id="accountService">
<user name="client" password="" authorities="secureconn" />
<user name="client1" password="" authorities="secureconn1" />
</user-service>
</authentication-provider>
</authentication-manager>

In container based security the Authorization can be achieved by a simple, security constraint mechanism as below,

<security-constraint>
<web-resource-collection>
<web-resource-name>Demo App</web-resource-name>
<url-pattern>/secure/*</url-pattern>
<http-method>GET</http-method>
</web-resource-collection>
<auth-constraint>
<role-name>secureconn</role-name>
</auth-constraint>
</security-constraint>

In spring we can achieve Authorization as below, in the spring-security context. The authorization in Container based Security is limited to virtual folder within the container. But with Spring Security we can provide regular expression and secure portions of application.

<http use-expressions="true">
<intercept-url pattern="/**" access="IS_AUTHENTICATED_ANONYMOUSLY" requires-channel="https"/>
<intercept-url pattern="/secure1/**" access="hasRole('supervisor')"/>
<intercept-url pattern="/secure/**" access="isAuthenticated()" />
</http>

Spring also extends security further to Service layer using a technique called ACL, refer Spring Security document.

If you notice for simple authentication/authorization capabilities container based security is enough, but for more complex enterprise service level security, it make sense to consider Spring Security.

I hope this blog helped.

Spring Security Certificate Authentication Authorization Example

For introduction to Spring Security refer this blog.

In continuation of my earlier blog Container based Security and Spring Security, in this blog, I will demonstrate how you can achieve Certificates Authentication and Authorization in Spring Security. As with all my blogs, the sample code for this is @ Github.

As mentioned in Enabling CLIENT-CERT based authorization on Tomcat,

  • You need to create keystore information
  • You need to change the connector configuration in tomcat to work with SSL
  • We dont need to configure MemoryRealm, because we use Spring Security for authentication

Continue and do the below steps,

  • Go to spring-mvc-client3 folder in the sample codebase and run “mvn clean package -DskipTests” it will create a war file in target folder, copy the war file into tomcat webapps folder.
  • Go to spring-mvc-client4 and repeat the maven command and copy the war file into tomcat webapps folder.
  • Start the tomcat server by going to <tomcat installed folder>/bin/startup.bat
  • Go to spring-mvc-client3 folder and run “mvn test” and notice all the tests are successful

In the web.xml you need to put the following code,

<filter>
<filter-name>springSecurityFilterChain</filter-name>
<filter-class>org.springframework.web.filter.DelegatingFilterProxy</filter-class>
</filter>

<filter-mapping>
<filter-name>springSecurityFilterChain</filter-name>
<url-pattern>/*</url-pattern>
</filter-mapping>

<listener>
<listener-class>org.springframework.security.web.session.HttpSessionEventPublisher</listener-class>
</listener>

To understand what is going open spring-mvc-client3/src/main/webapp/WEB-INF/applicationContext-security.xml, you will notice below configuration,

<!-- There is no security requirement to access the html resources like css, js etc and maybe loggedout.jsp page -->
<http pattern="/static/**" security="none"/>
<http pattern="/loggedout.jsp" security="none"/>

<http use-expressions="true">
<!-- Only supervisor can access secure1 webresources and all the resources under it and this resource can be accessed only if the request is https-->
<intercept-url pattern="/secure1/**" access="hasRole('supervisor')" requires-channel="https"/>
<!-- You have to be logged in to access secure folder and all resources under it -->
<intercept-url pattern="/secure/**" access="isAuthenticated()" requires-channel="https"/>
<!-- You have to be logged in to access secure folder and all resources under it -->
<intercept-url pattern="/**" access="IS_AUTHENTICATED_ANONYMOUSLY" requires-channel="https"/>

<!-- Here is where you provide a regular expression to extract user identity from the certificate and pass it to a authentication provider, in this example,
there is a dummy authentication provider as below, in real example, the auth provider is something like LDAP -->
<x509 subject-principal-regex="CN=(.*?)," user-service-ref="accountService" />
</http>

<authentication-manager>
<authentication-provider>
<!-- Dummy anthentication provider -->
<user-service id="accountService">
<user name="client1" password="" authorities="supervisor" />
<user name="client2" password="" authorities="user" />
</user-service>
</authentication-provider>
</authentication-manager>

Open the JUnit test @ spring-mvc-client3/src/test/java/com/goSmarter/springsecurity/SecureHttpClient1Test.java and see based on the above configuration, I have 2 positive testcases and 1 negative testcase. If you notice testSecurePage, user “client1” can access “secure1” folder because he is a “supervisor”, it is returning http OK(200). If you notice testSecurePageNegativeCase, user “client2” cannot access “secure1” folder because he is a “user”, it is giving forbidden(403) http code.

I hope this blog helped you.

Jsoup: nice way to do HTML parsing in Java

Typically you do HTML parsing in Java for various reasons like JUnit testing, Web Crawling and others. I stumbled across JSoup and tried few things to understand its capabilities. If you do some googling you can come across few good articles in Stackoverflow like, What is a good java web crawler library? and JSoup vs HttpUnit.

I had already worked with HttpUnit extensively. I felt that JSoup is better than HttpUnit. Let me demonstrate few of the capabilities of Jsoup in this blog,

Connecting to any website and parsing the data from that website into a DOM tree is as simple as,

URL url = new URL("http://gosmarter.net?query=cars");
Document doc = Jsoup.parse(url, 3000);

Where the integer value passed in the parse method is the timeout period set to return downloading from the site if it takes more time.

If you want to retrieve a table or a div from the DOM tree you do as below,

Iterator<Element> productList = doc.select("div[class=productList]").iterator();
assertNotNull(productList.hasNext);
while (productList.hasNext()) {
//Do some processing
}

If you want to extract an Image URL you do this way,

Element productLink = product.select("a").first();
String href = productLink.attr("abs:href");

Note in the above code, “abs:href”, will return the absolute url if the path is relative. Also the Element class is jsoup class, this has capabilities like select method, which is used to query based on intelligent jsoup query language. It also has a attr method, where, for a given element we can retrieve a specific attribute, in this example, we are retrieving href attribute of “a” link html tag. The first method returns always the 1st element, if there are lot of “td” or “tr” or a “li” html tag.

You can also get a specific element in a “td” or a “tr” or a “li” html tag as below,

Element descLi = product.select( "li:eq(0)").first();

Note above the select query is requesting 1st element or 0 index element from the list. The syntax is like “li:eq(0)”.

You can retrieve the text within a tag, for example, if you want to retrieve the text in the “a” link html tag, you do as below,

Element descA = product.select( "a").first();
String desc = descA.text();

Note text method is used to retrieve the text.

Finally if you want to retrieve an entire html content of a element you can do as below,

Element descA = product.select( "a").first();
String descHtmlData = descA.html();

Note you use html method to achieve retrieving html content of an element. This is useful for debug purpose.

There is also maven jar available in Apache Maven repository as below,

<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.7.1</version>
</dependency>

I hope this blog helped you.