Tag Archives: Latest in Software

SPARQL and dbpedia: Getting structured data from wikipedia

I was always wonder if we can extract structured data from Wikipedia. I stumbled up on DBPedia and SPARQL. DBPedia stores Wikipedia data as Dataset and it can be accessed using SPARQL. Let me demonstrate this with an example.

DBPedia has a SPARQL endpoint . And you can use SNORQL for exploring DBPedia. Let us execute the below SPARQL query in SNORQL and notice the resultset that is returned,

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT DISTINCT ?film_title ?star_name
where {?film_title rdf:type <http://dbpedia.org/ontology/Film> .
?film_title  foaf:name ?film_name .
?film_title rdfs:comment ?film_abstract .
?film_title dbpedia-owl:starring ?star .
?star dbpprop:name ?star_name
}
LIMIT 5

I get the results as below,

SPARQL results from DBPedia

SPARQL results from DBPedia

As good place to learn SPARQL is http://answers.semanticweb.com/ .

I hope this article helps you.

Advertisements

GATE, NLTK: Basic components of Machine Learning (ML) System

Machine Learning Components

I am currently building a Machine Learning system. In this blog I want to captures the elements of a machine learning system.

My definition of a Machine Learning System is to take voice or text inputs from a user and provide relevant information. And over a period of time, learn the user behavior and provides him better information.  Let us hold on to this comment and dissect apart each element.

In the below example, we will consider only text input. Let us also assume that the text input will be a freeflowing English text.

  • As a 1st step, when someone enters a freeflowing text, we need to understand what is the noun, what is the verb, what is the subject and what is the predicate. For doing this we need a Parts of Speech analyzer (POS), for example “I want a Phone”. One of the components of Natural Language Processing (NLP) is POS.
  • For associating relationship between a noun and a number, like “Phone greater than 20 dollers”, we need to run the sentence thru a rule engine. The terminology used for this is Semantic Rule Engine
  • The 3rd aspect is the Ontology, where in each noun needs to translate to a specific product or a place. For example, if someone says “I want a Bike” it should translate as “I want a Bicycle” and it should interpret that the company that manufacture a bicycle is BSA, or a Trac. We typically need to build a Product Ontology
  • Finally if you have buying pattern of a user and his friends in the system, we need a Recommendation Engine to give the user a proper recommendation

2 programming language which have good support of these capabilities are Python and Java. Python has a framework called Natural Language Tool Kit (NLTK). To understand more about NLTK, please refer the Cookbook. Java has a framework called GATE, Stanford NLP and OpenNLP .

For this discussion, we will consider GATE and related frameworks.

In my subsequent blogs I will talk more in details.

Latest in JQuery, Javascript frameworks for Web development

Responsive Web Design is a useful paradigm where in your build a web application  and it runs on any device, adjusting itself with the layout. There are a few JQuery frameworks which enable this.

From past few days, I have been working with few good JQuery/ Javascript frameworks and understanding how they works. In this blog, I will show how JQuery works in a typical AJAX based environment. What are the few aspects that improves JQuery in writing better code. I will also show you, how some of the latest Javascript frameworks addresses these and how does the future of web development looks like.

What is AJAX, how Javascript frameworks helps in improving your application?

In the simplest terms, AJAX helps in partial rendering of a page, without loading the entire webpage, when a click event happens.

In a typical web application, for retrieving the data from the server, you typically do a server post back and reload the page and rebuild the page on the server side and return to the client. The disadvantage of this approach is you will have business logic in the view.

With JQuery you can separate the concerns as below,

<div>
<ul id="productCatalog">
</ul>
</div>
<div>
<!--Right side filter-->
<legend>Smart Wall</legend>
<form>
<textarea id="searchkey" rows="2" cols="20" ></textarea>
<br/>
<br/>
<button type="submit" id="submitButton">
<i></i>Search
</button>
</form>

</script>
$("#submitButton").click(function(){
key =  $("#searchkey").val();
alert('key=' + key);
var url = "https://www.googleapis.com/shopping/search/v1/public/products?key=AIzaSyDTkzeQAoJDJb5Yvy3-bzIBXywZBK4kjkA&country=US&q="+key+"&alt=json"
alert('url=' + url);
$.getJSON(url, function (data){
// Format the data using the catTemplate template
alert('data=' + data);
$("#flickrTemplate").tmpl(data.items).appendTo("#productCatalog");
});
});

</script>

<include src="./datatemplates/productcatalog.html" />

productcatalog.html looks as below,

 <script id="flickrTemplate" type="text/x-jquery-tmpl">
<li>
<div>
<a href="${product.link}" target="_blank" >
<img src="${product.images[0].link}" alt="" width="150px;" border="0">
</a>
<h5>${product.title}</h5>
<p>Price: $ ${product.inventories[0].price} &nbsp;&nbsp;<a href="${product.link}" target="_blank" >Buy</a></p>
</div>
</li>
<script>

How do you achieve separation of concerns in Javascript frameworks?

If you notice in the above example, there is no business logic in the html view. The magic is achieved by including jquery.tmpl.min.js in the html page as below,

<script src="assets/js/jquery.js"></script>
<script type="text/javascript" src="Scripts/jquery-ui-1.8.20.js"></script>
<script type="text/javascript" src="js/jquery.tmpl.min.js"></script>

The data binding in the template happens when you use ${product.link} syntax . This is an Underscore templating syntax. There are some other good template support in JQuery Handlebars + Mustache  to name a few.

How advanced frameworks take this further?

There are lot of great Javascript frameworks and there is a good article comparing them.

My personal favorite is Angular.js . As you notice they have much higher level abstraction like object binding, dependency injection on the client side to name a few. Angular.js has good documentation and tutorials to learn from.

Meet Amazon EC2: Bigdata on the Cloud

I have been exploring Amazon EC2 as a Cloud based alternative to Midsized Software Product Company. I stumbled across this Slideshow Presentation from Netflix, a frontier in running their entire IT on Amazon’s EC2 platform. As per Netflix, they don’t have any data center, amazing isn’t it?

So I started exploring Amazon EC2 and how a company can run their entire IT in Amazon EC2. I was more interested in the technology stand point.

For a starter, in Amazon you can create various Linux instances including Ubuntu for free. They are elastic servers, where you can increase the RAM, Processing power on demand. Once you setup the instance, you can ssh on to the machine and do pretty much whatever you want. Refer this youtube link for how to setup Amazon.

As per Amazon, in a month, you get 750hrs free server usage, in simple words, that is plenty for testing your business idea. There is standard Amazon Machine Images (AMI) which has various pre-configured stacks including LAMP. Developing a decent Web application and exposing to the users is easy.

The interesting thing I noticed, it does have good Hadoop, MapReduce support. For more details of how to setup Hadoop in Amazon refer this youtube link. There are few commandline interface (cli) tools to manage EMR.

In Amazon the equivalent of HDFS is s3. Equivalent of Hadoop is Elastic MapReduce.

Hummingbird – real time data analytics

Recently I came across a interesting Company called Gilt which is one of the leading eCommerce company. What is interesting about it is, the whole company runs on real time analytics. Instead of I explaining what is real time analytics, click on the Hummingbird Slideshare presentation, Hummingbird Vimeo link, Hummingbird website  and Hummingbird Demo to understand more. The good news is they donated the code that runs this to Git here. Setting up Himmingbird is described in the Git location. The Hummingbird tracking pixel server by default starts in port 8000 and the Monitor server start in port 8080.

About Hummingbird

Hummingbird is build with Node.js and MongoDb. It works based on tracking pixel concept. Where in the framework will expose a 1×1 tracking pixel. You can embed this pixel in your html pages as below,

<img id="tracking-pixel" src="http://localhost:8000/tracking_pixel.gif?events=scAdd&productId=10&ip=12.232.43.121" alt="data analytics" />

And in the querysting of the image, pass whatever value you want to analyse, for example if you want know to how many times a product has been purchased from which location, the data is stored in MongoDb and later it can be analysed over a time range. The over all architecture is as below,

data analytics

Understanding document based MongoDb NoSQL

Introduction: MongoDb NoSQL

This article touches upon various aspects of RDBMS and where RDBMS hits the roadblock in an enterprise world. This article also talks about how we can overcome this road blocks by relaxing few key aspects of RDBMS. It also talks about how a NoSQL based solution fits into this and as a popular MongoDb NoSQL solution.

RDBMS related to enterprise scalabilities, performance and flexibilities

RDBMS evolved out of strong roots in math like Relational and Set theories. Some of the aspects are Schema validation, normalized data to avoid duplication, atomicity, locking, concurrency, high availability and one version of the truth.

While these aspects have lot of benefits in terms of data storage and retrieval, they can impact enterprise scalabilities, performance and flexibilities. Let us consider a typical purchase order example. In RDBMS world we will have 2 tables with one-to-many relationship as below,

Consider that we need to store huge amount of Purchase orders and we started partitioning, one of the ways to partition is to have OrderHeader table in one Db instance and LineItem information in another. And if you want to insert or Update an Order information, you need to update both the tables atomically and you need to have a transaction manager to ensure atomicity. If you want to scale this further in terms of processing and data storage, you can only increase hard disk space and RAM.

The easy way to achieve Scaling in RDBMS is Vertical Scaling

MongoDb NoSQL

Let us consider another situation, because of the change in our business we added a new column to the LineItem table called LineDesc. And imagine that this application was running in production. Once we deploy this change, we need to bring down the server and for some time to take effect this change.

Achieving enterprise scalabilities, performance and flexibility

Fundamental requirements of modern enterprise systems are,

  1. Flexibilities in terms of scaling database so that multiple instance of the database can process the information parallel
  2. Flexibilities in terms of changes to the database can be absorbed without long server downtimes
  3. Application /middle tier does not handle Object-relational impedance mismatch – Can we get away with it using techniques like JSON

Let us go back to our PurchaseOrder example and relax some of the aspects of RDBMS like normalization (avoid joins of lot of rows), atomicity and see if we can achieve some of the above objectives.

Below is an example of how we can store the PurchaseOrder (there are other better way of storing the information).

orderheader:{
orderdescription: “Krishna’s Orders”
date:"Sat Jul 24 2010 19:47:11 GMT-0700(PDT)",
lineitems:[
{linename:"pendrive", quantity:"5"}, {linename:"harddisk", quantity:"10"}
]
}

If you notice carefully, the purchase order is stored in a JSON document like structure. You also notice, we don’t need multiple tables, relationship and normalization and hence there is no need to join. And since the schema qualifiers are within the document, there is no table definition.

You can store them as collection of objects/documents. Hypothetically if we need to store several millions of PurchaseOrders, we can chunk them in groups and store them in several instances.

If you want to retrieve PurchaseOrders based on specific criteria, for example all the purchase orders in which one of the line item is a “pendrive”, we can ask all the individual instances to retrieve in “parallel” based on the same criteria and one of them can consolidate the list and return the information to the client. This is the concept of Horizontal Scaling

MongoDb NoSQL

Because the there is no separate Table schema and and the schema definition is included in the JSON object, we can change document structure and store and retrieve with just change in application layer. This does not need database restart.

Finally the object structure is JSON, we can directly present it to the web tier or mobile device and they will render it.

NoSQL is a classification of Database which is designed to keep the above aspects in mind.

MongoDb: Document based NoSQL

MongoDb NoSQL database is document based which is some of the above techniques to store and retrieve the data. There are few NoSQL databases that are Ordered Key Value based like Redis, Cassandra whichalso take these approaches but are much simpler.

If you have to give RDBMS analogy, Collection in MongoDb are similar to Tables, Document are similar to Rows. Internally MongoDb stores the information as Binary Serializable JSON objects called BSON. MongoDb support JavaScript style query syntax to retrieve BSON objects.

MongoDb NoSQL

Typical documents looks as below,

post={
author:“Hergé”,
date:new Date(),
text:“Destination Moon”,
tags:[“comic”,“adventure”] }

> db.post.save(post)

------------

>db.posts.find()  {

_id:ObjectId(" 4c4ba5c0672c685e5e8aabf3"),
author:"Hergé",
date:"Sat Jul 24 2010 19:47:11 GMT-0700(PDT)",
text:"Destination Moon",
tags:["comic","adventure"]
}

In MongoDb, atomicity is guaranteed within a document. If you have to achieve atomicity outside of the document, it has to be managed at the application level. Below is an example,

Many to many:

products:{
_id:ObjectId("10"),
name:"DestinationMoon",
category_ids:[ObjectId("20"),ObjectId("30”)]}
categories:{
_id:ObjectId("20"),
name:"adventure"}

//All products for a given category

>db.products.find({category_ids:ObjectId("20")})

//All categories for a given product
product=db.products.find(_id:some_id)

>db.categories.find({

_id:{$in:product.category_ids}
})

In a typical stack that uses MongoDb, it makes lot of sense to use a JavaScript based framework. A good web framework, we use Express/Node.js/MongoDb stack. A good example of how to use these stack is here.

MongoDb NoSQL also supports sharding which supports parallel processing/horizontal scaling. For more details on how a typical BigData handles parallel processing/horizontal scaling, refer Rickly Ho’s link

A typical use cases for MongoDb include, Event logging, Realtime Analytics, Content Management, Ecommerce. Use cases where it is not a good fit are Transaction base Banking system, Non Realtime Data warehousing

References:

Cloudfoundry and MongoDB NoSQL sample application

Introduction: MongoDb NoSQL

Download the source code here.

As lot of you folks, I also had a question, what is cloud computing? I started googling and downloaded few tools and played with them and understood few concepts.

In this section I will be discussing about one of the key concepts of Cloud computing – Platform as a Service a.k.a PaaS a.k.a Cloud Platform. Basically a Cloud Platform provide development support in a local environment, and deployment into remote environment and the cloud platform will “introspect” to which webserver/which database the application has to run. Let me illustrate this with a diagram

MongoDb NoSQL

There are few leading players in this space Salesforce.com (Heroku, Force.com) and Microsoft (Azure). Recently VMware entered into this space with their own Cloudfoundry. Coming from Java background this is a good tool to understand the details of a Cloud Platform. It is well integrated with STS IDE and it is bit buggy, but there are workarounds.

This section discusses about,

  1. How Cloudfoundry Cloud Platform supports, WebServer and database
  2. How Cloudfoundry Cloud Platform helps in Database seeding
  3. Deploying to Cloud platform
  4.  ‘glue’ to inform the Cloud Platform about Webserver and Database
  5.  ‘glue’ to generate the dbschema and the Seed data

Details

I will walk you thru a simple example using Spring MVC and MongoDB, where you do a basic CRUD operation on person table. MongoDB is a Document based database, which is used to store large amount of data, this also has Map/Reduce capabilities similar to Hadoop.

To quickly start on this,

Cloudfoundry support:

WebServer and database support: Cloudfoundry Supports SpringSource tc Server, it also supports Jetty if used with Maven. Database side,It supports MySQL, vPostgres and MongoDB. It has the ability to introspect spring context file and understand which type of database application supports if you have a bean with type “org.apache.commons.dbcp.BasicDataSource” and it can bind it to the respective database. If it is MongoDB, it needs a mongo-db-factory as shown in this example.

Database seed data population: Typically if you have “jdbc:initialize-database” configuration in your application Cloudfoundry will execute that script in its bound database.

Configuration in the application:

Deploying to Cloudfoundry: There are 2 ways of deploying the code to Cloudfoundry, command line and STS IDE

If you want to deploy and Run the application using VMC command line, you need to do the following,

Move to the target folder
vmc target http://api.{instancename}.cloudfoundry.me

vmc push

give the application name as ‘spring-mongodb’

Bind to ‘mongodb’

Save configuration

Now open the browser and type ‘http://spring-mongodb.{instancename}.cloudfoundry.me/’

If you want to deploy and run the application from STS IDE, you need to do the following,

  • For setting up your STS to work with Cloudfoundry refer this link.
  • Import the maven project into you STS
  • Create a new Cloudfoundry Server and add the spring-mongodb application, and publish the application war to the Cloudfoundry host and see your changes.
  • You can also access the Remote System for error logs and files as mentioned in the SpringSource Blog mentioned above

Glue to inform the Cloudfoundry about the database:

The key changes you have to make in your application to work with Cloudfoundry,

  1. Maven changes
<!-- CloudFoundry -->

  org.cloudfoundry
  cloudfoundry-runtime
  ${org.cloudfoundry-version}

  1. Mongodb configuration

Glue to generate the dbschema and the Seed data in Cloudfoundry:

You can create a Bean called InitService with a method init, and add all the seed data that is needed for the application as below,

public class InitService {
 private MongoTemplate mongoTemplate;
 public MongoTemplate getMongoTemplate() {
   return mongoTemplate;
 }
 public void setMongoTemplate(MongoTemplate mongoTemplate) {
  this.mongoTemplate = mongoTemplate;
 }
private void init() {
Person p = new Person ();
 p.setId(UUID.randomUUID().toString());
 p.setFirstName("John");
 p.setAge(25);
 mongoTemplate.save(p);
 p = new Person ();
 p.setId(UUID.randomUUID().toString());
 p.setFirstName("Jane");
 p.setAge(20);
 mongoTemplate.save(p);
 ..
}

In the bean context, define a bean as below,


Other than these changes, everything else is same as any other Spring MVC application.

Way forward, you can get the Cloudfoundry samples.  Using Git utlitiy you can clone this in your local machine. There is a simple hello-spring-mysql sample; you can quickly understand how MySQL based application works using this example.