Datawarehouse implementation using Hadoop+Hbase+Hive+SpringBatch – Part 1

In continuation of my earlier article, I proceeded a little more in understanding of what are these technologies and where are these used and a technical implementation of these. I want present this material into 2 parts,

  1. Introduction to some of the terminologies and explaining the use case we are going to implement
  2. Implementation details of the use case

Hadoop is primarily a distributed file system (DFS) where you can store terabytes/petabytes of data on low end commodity computers. This is similar to how companies like Yahoo and Google store their page feeds.

Hbase is a BigTable implementation which can either use file system or Hadoop DFS to store the data. It is similar to RDBMS like Oracle. There are 2 main differences between,

  1. It is designed to store sparse data, use case is where there are lots of columns and there are no data stored in majority of these columns
  2. It has a powerful versioning capabilities, a typical use case it to track the trajectory of a rocket over a period of time

Hive is built on the Map Reduce concept and is built on top of Hadoop. It supports SQL like query language. Its main purpose is datawarehouse analysis and adhoc querying. Hive can also be built on top of Hbase.

Spring Batch is a Spring framework for managing a batch workflow.

Usecase:

Let us say you have a website which has lot of traffic. We need to build an application which analyses the traffic like number of times the user has come to the website. Which location the website is accessed most. Below is the data flow,

Log file feed structure,

192.168.45.129 07:45
192.168.45.126 07:46
192.168.45.127 07:48
192.168.45.129 07:49

There will be a lookup table in Hbase as below,

192.168.45.129 kp Boston
192.168.45.126 kk Bangalore
192.168.45.127 ak Bangalore

Now we write 2 applications:

  1. Spring batch will run a command line to pull the log feed and push the ip address feed to hbase
  2. Java commandline will pull the hbase information using hive to display the metrics as follows

presentation useranalysis
presentation locationanalysis

Now that you know the technology pieces and high level solution, in my next article I will write how we can implement this with working code.

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s