Datawarehouse implementation using Hadoop+Hbase+Hive+SpringBatch – Part 2

The svn codebase for this article is here.

In continuation to part 1, this section covers,

  1. Setup of a Hadoop, Hbase, Hive on a new Ubuntu VM
  2. Run Hadoop, Hbase and Hive as services
  3. Setup the Spring batch project and run the tests
  4. Some useful commands/tips

To begin with let me tell you the choice of using Hive was to understand not to use Hive as a JDBC equivalent. It was more to understand how to use Hive as a powerful datawarehouse analytics engine.

Setup of a Hadoop, Hbase, Hive on a new Ubuntu VM

Download the latest Hadoop, Hbase and Hive from the apache websites. You can also go to Cloudera website and get the Cloudera UBuntu VM and use apt-get install hadoop, hbase and hive. It did not work for me, if you are adventurous you can try that. You can also try MapR’s VMs. Both Cloudera and MapR have good documentation and tutorials.

Unzip the file in the home directory and go to .profile file and add the bin directories to the path as below,

export HADOOP_HOME=<HADOOP HOME>
export HBASE_HOME=<HBASE HOME>
export HIVE_HOME=<HIVE HOME>
export PATH=$PATH:$HADOOP_HOME/bin:$HBASE_HOME/bin:$HIVE_HOME/bin

sudo mkdir -p /app/hadoop/tmp

sudo chown <login user>:<machine name>/app/hadoop/tmp

hadoop namenode -format

Set HADOOP_HOME, HBASE_HOME and HIVE_HOME environment variables.
Run the ifconfig and get the ip address it will be something like, 192.168.45.129

Go to etc/hosts file and add an entry like,

192.168.45.129 <machine name>

Run Hadoop, Hbase and Hive as services

Go to hadoop root folder and run the below command,

start-all.sh

Open a browser and access http://localhost:50060/ it will open the hadoop admin console. If there are some issues, execute below command and see if there are any exceptions

tail -f $HADOOP_HOME/logs/hadoop-<login username>-namenode-<machine name>.log

Hadoop is running in 54310 port by default.
Go to hbase root folder and run the below command,

start-hbase.sh
tail -f $HBASE_HOME/logs/hbase--master-.log

See if there are any errors. Hbase is runing in port 60000 by default
Go to hive root folder and run the below command,

hive --service hiveserver -hiveconf hbase.master=localhost:60000 mapred.job.tracker=local

Notice, by giving the hbase reference we have integrated hive with hbase. Also hive default port is 10000. Now run hive as a command line client as follow,

hive -h localhost

Create the seed table as below,

  CREATE TABLE weblogs(key int, client_ip string, day string, month string, year string, hour string,
    minute string, second string, user string, loc  string) row format delimited fields terminated by '\t';

  CREATE TABLE hbase_weblogs_1(key int, client_ip string, day string, month string, year string, hour string,
    minute string, second string, user string, loc  string)
    STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, cf1:client_ip, cf2:day, cf3:month, cf4:year, cf5:hour, cf6:minute, cf7:second, cf8:user, cf9:loc")
    TBLPROPERTIES ("hbase.table.name" = "hbase_weblog");

  LOAD DATA LOCAL INPATH '/home/hduser/batch-wordcount/weblogs_parse1.txt' OVERWRITE INTO TABLE weblogs;

  INSERT OVERWRITE TABLE hbase_weblogs_1 SELECT * FROM weblogs;

Setup the Spring batch project and run the tests
To setup this project get the latest code from SVN mentioned in the beginning. Download gradle and setup the path in the .profile. Now run the below command to load the data,

  gradle -Dtest=org.springframework.data.hadoop.samples.DataloadWorkflowTests test

Run the below junit to get the analysis data,

  gradle -Dtest=org.springframework.data.hadoop.samples.AnalyzeWorkflowTests test

hadoopVersion is 1.0.2. build.gradle file looks as below,

repositories {     // Public Spring artefacts
  mavenCentral()
  maven { url "http://repo.springsource.org/libs-release" }
  maven { url "http://repo.springsource.org/libs-milestone" }
  maven { url "http://repo.springsource.org/libs-snapshot" }
  maven { url "http://www.datanucleus.org/downloads/maven2/" }
  maven { url "http://oss.sonatype.org/content/repositories/snapshots" }
  maven { url "http://people.apache.org/~rawson/repo" }
  maven { url "https://repository.cloudera.com/artifactory/cloudera-repos/"}
}

dependencies {
  compile ("org.springframework.data:spring-data-hadoop:$version")
  { exclude group: 'org.apache.thrift', module: 'thrift' }
  compile "org.apache.hadoop:hadoop-examples:$hadoopVersion"
  compile "org.springframework.batch:spring-batch-core:$springBatchVersion"
  // update the version that comes with Batch
  compile "org.springframework:spring-tx:$springVersion"
  compile "org.apache.hive:hive-service:0.9.0"
  compile "org.apache.hive:hive-builtins:0.9.0"
  compile "org.apache.thrift:libthrift:0.8.0"
  runtime "org.codehaus.groovy:groovy:$groovyVersion"
  // see HADOOP-7461
  runtime "org.codehaus.jackson:jackson-mapper-asl:$jacksonVersion"
  testCompile "junit:junit:$junitVersion"
  testCompile "org.springframework:spring-test:$springVersion"
}

Spring Data Hadoop configuration looks as below,

<configuration>
<!-- The value after the question mark is the default value if another value for hd.fs is not provided -->
  fs.default.name=${hd.fs:hdfs://localhost:9000}
  mapred.job.tracker=local</pre>
</configuration>

<hive-client host="localhost" port="10000" />

Spring Batch job looks as below,

<batch:job id="job1">
  <batch:step id="import">
    <batch:tasklet ref="hive-script"/>
  </batch:step>
</batch:job>

Spring Data Hive script for loading the data is as below,

<hive-tasklet id="hive-script">
  <script>
LOAD DATA LOCAL INPATH '/home/hduser/batch-analysis/weblogs_parse.txt' OVERWRITE INTO TABLE weblogs;
INSERT OVERWRITE TABLE hbase_weblogs_1 SELECT * FROM weblogs;
</script>
</hive-tasklet>

Spring Data Hive script for analyzing the data is as below,

<hive-tasklet id="hive-script">
<script>    SELECT client_ip, count(user) FROM hbase_weblogs_1 GROUP by client_ip;   </script>
</hive-tasklet>

Some useful commands/tips

For querying hadoop dfs you can use any file based unix commands like,

hadoop dfs -ls /
hadoop dfs -mkdir /hbase

If you have entered safemode in hadoop and it is not starting up you can execute below command,

hadoop dfsadmin -safemode leave

If you want find some errors in the file hadoop filesystem you can execute below command,

hadoop fsck /
Advertisements

6 thoughts on “Datawarehouse implementation using Hadoop+Hbase+Hive+SpringBatch – Part 2

  1. abdulrafayawan3nigma

    very informative post, i am willing to implement a data ware housing solution and my prior knowledge is not much as far as data warehousing is concerned. I have heard apache hadoop is used for data warehousing but that is all. Can you guide me to useful resources from where i can learn how exactly i can learn to integrate a data warehousing solution to my existing database application.
    Thanks for your time

    Reply
  2. AR

    first of all tnx alot for taking time for replying, i have started to consider to implement a dataware solution for my ERP and i plan on using Hive for this. There are a lot of questions in my mind like how i am going to architect the dataware house what schema should i implement (star etc) does the structure of my ware house will look like exactly the same as my database ? how to plan the landing databases? how i am going to import data into my ware house? do i have to write some routine?

    sorry if i seem like lost, i am reading alot to wrap my head around this concept of dataware house can you point me to some useful resources… also before i dive into Hive and the underlying tech it uses like hadoop is it possible to implement the dataware house using it for a medium size company? last question how steep is the learning curve…
    thanks for your time and guidance…

    Reply
  3. managed currency account

    First off I would like to say superb blog! I had a quick question which I’d like to ask if you do not mind. I was curious to know how you center yourself and clear your thoughts prior to writing. I have had trouble clearing my mind in getting my ideas out. I truly do take pleasure in writing but it just seems like the first 10 to 15 minutes are generally wasted just trying to figure out how to begin. Any recommendations or hints? Appreciate it!

    Reply
  4. nchaplot

    Many thanks for the info.

    I am trying the same thing. Although my hiveServer is executed in embedded mode.
    I am getting exception :
    Exception in thread “main” java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.exec.ExecDriver

    at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:306)

    The reason being in place of job being launched with hive-exec its getting launched with hiver-service :

    17:35:34.876 [pool-2-thread-1] WARN o.a.hadoop.util.NativeCodeLoader – Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
    17:35:34.877 [pool-2-thread-1] INFO o.a.hadoop.hive.ql.exec.ExecDriver – Generating plan file file:/tmp/impadmin/hive_2013-04-26_17-35-34_653_3435350642713735922/-local-10002/plan.xml
    17:35:35.072 [pool-2-thread-1] INFO o.a.hadoop.hive.ql.exec.ExecDriver – Executing: /home/impadmin/hadoop-1.0.1/bin/hadoop jar /home/impadmin/hadoopBook/spring-data-book-master/hadoop/file-polling-complete/target/appassembler/repo/hive-service-0.8.1.jar org.apache.hadoop.hive.ql.exec.ExecDriver

    Could you please provide some help.

    Reply
  5. thecaffeblog

    Thanks for the info.
    I tried the same thing but I am facing error while executig the mapreduce step.
    The step where hive query in converted into mapreduce task converts it wrongly at my place.

    In place of pickign hive-exec as executable jar it picks hive-service

    17:35:34.876 [pool-2-thread-1] WARN o.a.hadoop.util.NativeCodeLoader – Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
    17:35:34.877 [pool-2-thread-1] INFO o.a.hadoop.hive.ql.exec.ExecDriver – Generating plan file file:/tmp/impadmin/hive_2013-04-26_17-35-34_653_3435350642713735922/-local-10002/plan.xml
    17:35:35.072 [pool-2-thread-1] INFO o.a.hadoop.hive.ql.exec.ExecDriver – Executing: /home/impadmin/hadoop-1.0.1/bin/hadoop jar /home/impadmin/hadoopBook/spring-data-book-master/hadoop/file-polling-complete/target/appassembler/repo/hive-service-0.8.1.jar org.apache.hadoop.hive.ql.exec.ExecDriver
    Please provide some help.

    Thanks

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s