High Performance Scientific Computing on the Cloud?

What is this blog about?

Scientific Computing has been dominated and owned by elite scientists and engineers from premium universities and large engineering and aeronautical companies. There was little scope for students and enthusiasts to understand and learn:

With the advent of Amazon, Rackspace, Penguin as Cloud leaders, has this changed? Let us explore in this blog. This blog targets Geeks like me who want to setup an opensource solver on the HPC cloud.

Background on what I have been doing:

From past one year, I have been working with a worlds leading aeronautical company, automating their aircraft analysis and design workflow processes. There were lot of security restrictions to access any of their Solvers or understand how their High Performance Computing Servers work. This is very near to my background in mechanical engineering and a while ago had done a project in CFD and Finite Element Analysis using Ansys, I wanted to understand what is going on. Being a geek, I was curious to understand the nuts and bolts of High Performance Scientific Computing. Below is what I understood so far.

Nuts and Bolts of High Performance Scientific Computing

Solvers:

There are lot of open source solvers, there is a good discussion on, Why isn’t open source CFD solution for everyone?. One of the leading one is Stanford University Unstructured (SU2) solver. SU2 is back bone of some of the cloudbased tools like SimScale. Interestingly the sourcecode is in Github. So let us dive into it. It will take hardly couple of hrs to set this solver on Ubuntu Linux and run a simple airfoil mesh generation and view the airfoil on an opensource plot viewer, ParaView.

For quick start, you need gc++ and make utility. Git clone SU2_EDU source code and Run the SU2_EDU first, the instructions are provided in the link itself. Before running, go to bin directory and do few tweaks to the configuration file ConfigFile_RANS.cfg as below.

# Replace OUTPUT_FORMAT= TECPLOT to PARAVIEW
OUTPUT_FORMAT= PARAVIEW

#run command ./SU2_EDU
# Select option1
# Select the file airfoil_rae2822_lednicer.dat
# it will run for 2000 iterations and creates flow.vtk and solver_flow.vtk files.

once your run is complete, open ParaView GUI and load the flow.vtk and solver_flow.vtk files and you can see the airfoil mesh and volume grids.

Cloud based HPC:

A quick googling will show there are 2 leaders in HPC on the cloud, Rackspace and Penguin Computing. Penguin has a free plan to run a 5min HPC job. You just need to register and run the job to experience how to run a Solver on HPC.

Integrating Solver with HPC:

If you want to run the solver in a Cloud based HPC server the key is to re-build su2 with OpenMPI capabilities as below,

./configure --with-MPI=mpicxx --with-Metis-lib=/usr/local/metis-4.0.3 --with-Metis-include=/usr/local/metis-4.0.3/Lib

Once you build it, you can upload the binaries on the cloud and run the solver in as many CPU’s as you can, the response time will be considerable good. one way to test the details of the job is,

qstat -w <the PBS id returned>

Final Verdict

It is absolutely possible to setup a open source decent solver on the HPC cloud and run few solutions. Hope this blog was helpful.

Physics Engine and HTML5 Canvas

Intent of this blog space: The intent of this blog space is to build a testbed for testing various concept of Physics Engine and modelling computer graphical elements and visually testing them using any device by using Physics Engine and HTML5 Canvas. For the people in hurry,

  • Clink here to see the progress of the testbed
  • Get the latest code of my Github and follow how to setup the application and run locally.

In this blog I would be discussing about what is Physics Engine and what are the various frameworks that supports Physics Engine. I will also be discussing technologies that supports Physics Engine to support HTML5 Canvas. What is Physics Engine? Physics Engine is a computer simulation tool that is used extensively in Game Development, Education Purpose and various applications. A quick YouTube search will yield various links as an application of Box2D. This tool aids in applying all the Physical laws like Gravity, Friction, Force on object displayed on the screen. There are lot of different tools and languages that supports Physics Engine. Box2D API is one of the standard frameworks that is supported in various programming languages like, C++, Java, Javascript, Clojure and Python. And all these tools also support a testbed to write application using Physics Engine and test them in a GUI. There is also a IDE build around Physics Engine called iforce2d RUBE. What is HTML5 Canvas? HTML5 Canvas will become defacto standard for display graphics on the web page. The closest competitor for this is Flash, WebGL. In the next few blogs I will discuss about the architecture of how I built the application, stay tuned.

Java to JavaScript journey

Being a Java guy, I started off spending time understanding what are good web frameworks for someone with strong Java background. In the beginning of my journey, I hated Javascript as I hate seeing famous “undefined” errors in browser, but that somewhat changed in the end. For most of my learning, I wanted to stick with current bestpractices/tools like Bootstrap, HTML5, Responsive Webdesign (RWD), Single Page Application (SPA), Model View View Model (MVVM)

Let me start with Java world:

In Java world, there are 2 competing frameworks where you can pretty much do all the best practices of web development like HTML5, Bootstrap, RWD, SPA. Vaadin (www.vaadin.com), ZKOSS (www.zkoss.org). Both take different approaches. Vaadin is built heavily on GWT (http://www.gwtproject.org/) and it is 100% java code including frontend markup. Vaadin 7 the latest edition has a poor support for CSS. I liked ZKOSS a lot.

  • It has a nice blend of markup and Java support,
  • it has good support for CSS (SCSS),
  • it has Bootstrap theme out of the box,
  • it encourages SPA a lot and
  • it has a decent MVVM support, but you have to pay for it, I would rate Knockout.js much better than this.

I ended up building a decent excel based application on Heroku. As I proceeded to do complex things, I realized that in Java we end up with lot of boilerplate coding. I stumbled up on http://projectlombok.org/ to reduce some boilerplate, but still lot of boilerplate… this forced me to think if this is the reason people move towards modern programming languages like Scala, Groovy, and my latest favorite Clojure.

Clojure in particular is interesting because it is JVM based, follows Lisp way of working, you can write efficient code and avoid boilerplate, and has ClojureScript for web development. I did spend close to 1month exploring if it is good for web development.. the good thing is I learnt a new language. The big disadvantage is it has a huge learning curve, and it is not for weak hearted, you need to be good in emacs and some good frameworks like http://hoplon.io/ only work in UNIX platforms.

Javascript world:

In essence, I realized that if you want to be good web development person you cannot avoid Javascript, love it or hate it, embrace it. Again my journey started with understanding Ember.js. It is a good framework, but poor documentation and heavyweight. One of my friend pointed me to Knockout.js (it has a strong MVVM support), I loved it. I was also looking for a SPA, I stumbled upon, http://durandaljs.com/ , according to me this is the right direction for SPA.

If you want a good IDE for Javascript, NetBeans IDE is one of the best. I am spoilt by STS IDE, features like intellisense, code formatting, running the app from within the IDE are a must. Luckily NetBeans supports all of these and more, it is tightly integrated with Chrome browser and has a plugin within Chrome for running and debugging frontend (much better than firebugs etc..).

Finally for server side, there were 2 candidates I was exploring, Node.Js and Grails. Again I have this love / hate relationship with Javascript I am still not sold on Nodejs . I am settling with Grails because it has strong STS IDE support. I am trying to rewrite my “excel on the web” application, I deployed on Heroku with durandaljs, Knockoutjs on the client side and Grails/ Groovy on the serverside, with strong JSON contracts. I will share some of these learning

Conclusion:

As I mentioned earlier, embrace Javascript, there is lot of matured frameworks in JS (Ember.js, Angular.js, Backbone.js, Knockout.js are popular ones). If you have a good IDE you can actually be productive. Also on the server side there are alternatives to Java like Groovy, Node.js. My preference is Groovy because I come from Spring background and I love STS IDE.

An insight on Big data

OLTP: Refers to Online Transaction Processing

It’s a bunch of programs that used to help and manage transactions (insert, update, delete and get) oriented applications. Most of the OLTP applications are faster because the database is designed using 3NF. OLTP systems are vertically scalable.

OLAP:  Refers to Online Analytical Processing

The OLAP is used for Business Analytics, Data Warehousing kind of transactions. The data processing is pretty slow because it enables users to analyze multidimensional data interactively from multiple data sources. The database schema used in OLAP applications are STAR (Facts and Dimensions, normalization is given a pass, redundancy is the order of the day).  OLAP systems are horizontally scaling.

OLTP vs. OLAP: There are some major difference between OLTP and OLAP.

Data Sharding:

Traditional way of database architecture implements vertical scaling that means splitting the table into number of columns and keeping them separately in physical or logically grouping (tree structure). This will lead into performance trouble when the data is growing. Need to increase the memory, CPU and disk space each and every time when we hit the performance problem.

To eliminate the above problem Data Sharding or Shared nothing concept is evolved, in which the database are scaled horizontally instead of vertically using the master/slave architecture by breaking the database into shards and spreading those into number of vertically scalable servers.

The Data Sharding concept is discussed detail in this link.

MPP (Massive Parallel Processing systems): Refers to the use of a large number of processors (or separate computers) to perform a set of coordinated computations in parallel.

MPP is also known as cluster computing or shared nothing architecture discussed above.

The examples for MPP are TeraData, GreenPlum.

Vertical and Horizontal Scaling:

Horizontal scaling means that you scale by adding more machines into your pool of resources where vertical scaling means that you scale by adding more power (CPU, RAM) to your existing machine.

In a database world horizontal-scaling is often based on partitioning of the data i.e. each node contains only part of the data, in vertical-scaling the data resides on a single node and scaling is done through multi-core i.e. spreading the load between the CPU and RAM resources of that machine.

With horizontal-scaling it is often easier to scale dynamically by adding more machines into the existing pool – Vertical-scaling is often limited to the capacity of a single machine.

Below are the differences between vertical and horizontal scaling.

CAP Theorem: The description for the CAP Theorem is discussed in this link.

Some insight on CAP: http://www.slideshare.net/ssachin7/scalability-design-principles-internal-session

Greenplum:  Greenplum Database is a massively parallel processing (MPP) database server based on PostgreSQL open-source technology. MPP (also known as shared nothing architecture) refers to systems with two or more processors which cooperate to carry out an operation – each processor with its own memory, operating system and disks.

The high-level overview of GreenPlum is discussed in this link.

Some Interesting Links:

http://www.youtube.com/watch?v=ph4bFhzqBKU,

http://www.youtube.com/watch?v=-7CpIrGUQjo

Hbase:  HBase is the Hadoop database. It is distributed, scalable, big data storage. It is used to provide real-time read and write access to large database which uses cluster (master/slave) architecture to store/retrieve data.

  • Hadoop is open source software developed by Apache to store large data.
    • MapReduce is used for distributed processing of large data sets on clusters (master/slave). MapReduce takes care of scheduling tasks, monitoring them and re-executing any failed tasks. The primary objective of Map/Reduce is to split the input data set into independent chunks and send to the cluster. The MapReduce sorts the outputs of the maps, which are then input to the reduce tasks. Typically, both the input and the output of the job are stored in a file system.
    • The Hadoop Distributed File System (HDFS) primary objective of HDFS is to store data consistently even in the presence of failures. HDFS uses a master/slave architecture in which one device (the master) controls one or more other devices (the slaves). The HDFS cluster consists of one or more slaves who actually contain the file system and a master server manages the file system namespace and regulates access to files.

Hbase documentation is provided in this link.

Some Interesting Links:

http://www.youtube.com/watch?v=UkbULonrP2o

http://www.youtube.com/watch?v=1Qx3uvmjIyU

http://www-01.ibm.com/software/data/infosphere/hadoop/hbase/

1. Do GreenPlum achieve MPP? Is GreenPlum uses Hadoop file system?

A)     GreenPlum VS MPP: GreenPlum follows MPP(Massive Parallel Processing) architecture. The architecture is discussed with the use case here.

B)      GreenPlum VS Hadoop: Yes, GreenPlum can use Hadoop internally. The use case is discussed detail in this link.

2. What are the different storage methodologies? Compare them.

Data Storage methodologies: There are three different methodologies, they are.

  • Row based storage
  • Column based storage
    • Difference between row and column based database are described in this link.
    • NoSQL : Is described detail in this link. It has three different concepts,
      • Key Value 
      • §  Document Store
      • §  Column Store

Some key points on different storage methodologies are,

Storage Methodologies Description Common Use Case Strength Weakness Size of DB Key Players
Row-based Data structured or stored in Rows. Used in transaction processing, interactive transaction applications. Robust, proven technology to capture intermediate transactions. Scalability and query processing time for huge data. Sybase, Oracle, My SQL, DB2 Sybase, Oracle, My SQL, DB2
Column-based Data is vertically partitioned and stored in Columns. Historical data analysis, data warehousing and business Intelligence. Faster query (specially ad-hoc queries) on large data. Not suitable for transaction, import export seep & heavy computing resource utilization. Several GB to 50 TB. Info Bright, Asterdata, Vertica, Sybase IQ, Paraccel
NoSQL-Key Value Stored Data stored in memory with some persistent backup. Used in cache for storing frequently requested data in applications. Scalable, faster retrieval of data , supports Unstructured and partial structured data. All data should fit to memory, does not support complex query. Several GBs to several TBs. Amazon S3, MemCached, Redis, Voldemort
NoSQL- Document Store Persistent storage of unstructured or semi-structured data along with some SQL Querying functionality. Web applications or any application which needs better performance and scalability without defining columns in RDBMS. Persistent store with scalability and better query support than key-value store. Lack of sophisticated query capabilities. Several TBs to PBs. MongoDB, CouchDB, SimpleDb
NoSQL- Column Store Very large data store and supports Map-Reduce. Real time data logging in Finance and web analytics. Very high throughput for Big Data, Strong Partitioning Support, random read-write access. Complex query, availability of APIs, response time. Several TBs to PBs HBase, Big Table, Cassandra