Open Stack meets Hadoop

There has been good announcements this week.

Open Stack allows us to create cloud inside the corporate walls using solid foundations of standards setup world wide with collaboration of many good companies.

Now Mirantis has announced a project which allows you to provision Hadoop based clusters easily within organization. IT uses Ambari as its background engine to do all this magic.

Quoting some of the information and architecture at official website.

Read more at

http://savanna.mirantis.com/

http://www.mirantis.com/blog/project-savanna-moves-ahead-red-hat-and-hortonworks-commit-to-work-on-hadoop-as-a-service-on-openstack/

_images/openstack-interop.png

Summary of above components

The Savanna product communicates with the following OpenStack components:

  • Horizon - provides GUI with ability to use all of Savanna’s features;
  • Keystone - authenticates users and provides security token that is used to work with the OpenStack, hence limiting user abilities in Savanna to his OpenStack privileges;
  • Nova - is used to provision VMs for Hadoop Cluster;
  • Glance - Hadoop VM images are stored there, each image containing an installed OS and Hadoop; the pre-installed Hadoop should give us good handicap on node start-up;
  • Swift - can be used as a storage for data that will be processed by Hadoop jobs.

The main advantages include

  • You can choose which Hadoop version to run
  • Run everything against existing infrastructure
  • fast provisioning of Hadoop clusters on OpenStack for Dev and QA;
  • utilization of unused compute power from general purpose OpenStack IaaS cloud;

Read the link above for awesome details.

Thanks for reading.

Please leave your comments below , or connect with me via Linkedin.

Add Latex MathML to Blogger

To insert Math formulas in Blogger you can use MathJax library
 
Add the following in your template just before closing head tag
 
<script 
src='http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'
type='text/javascript'
/>

</head>

 


Test by copying following and save the page

 

When $a \ne 0$, there are two solutions to \(ax^2 + bx + c = 0\) and they are
$$x = {-b \pm \sqrt{b^2-4ac} \over 2a}.$$

 

Example 2

\begin{pmatrix}
2    3 \\
5    4
\end{pmatrix}

 

Tip:

 

Writing Latex manually can be confusing.

 

so go to this website , write your equation and copy the code paste to your website

 



 

Eclipse Python Development Setup

To write Python code using Eclipse you can download

http://pydev.org/download.html

Eclipse Plugin

Add the following as Update URL

http://pydev.org/updates

You can also get zip setup files from

http://sourceforge.net/projects/pydev/files/pydev/

 

Configuration and Testing

 

After installing tell Eclipse where is your Python installed to be used as Interpreter.

Windows > Preferences >

Go to: window > preferences > PyDev > Interpreter - (Python/Jython/IronPython).

For other OS see http://pydev.org/manual_101_interpreter.html

 

The following tutorial gives very good quick intro for Python and Eclipse

http://www.vogella.com/articles/Python/article.html

I also suggest you go to go to quick Python intro post i wrote

 

Also if you are from Java background then you may check

http://python4java.necaiseweb.org/Main/TableOfContents

http://www.daimi.au.dk/~chili/CSS/pythonForJavaProgrammers.htm

http://pythonconquerstheuniverse.wordpress.com/2009/10/03/python-java-a-side-by-side-comparison/

 

Python and Maven

http://blog.berczuk.com/2009/12/continuous-integration-of-python-code.html

 

// TODO Python Maven setup and Integration for project layouts

 

A Good step by step Python tutorial website

http://learnpythonthehardway.org/book/intro.html

Official website for documentation

http://docs.python.org/2/index.html

Unit Testing

http://pyunit.sourceforge.net/

Life after Google Reader

Okay , Google reader is still there for few more days before it vanishes in blackhole.

http://googlereader.blogspot.com.au/2013/03/powering-down-google-reader.html

I tried various online readers after that now ended up using Feedly and Netvibes mostly.

Its pretty good tool which provides various functionalities of Adding RSS feeds , tracking what i read.

Besides this it can also fetch news / things for me for any topic i am interested to research on

I would recommend you to try it once.

Feedly has better Usability then Netvibes but Feedly is slow in getting data for new feeds

What did you start using ?

Cloudera Certified Specialist in Apache HBase CCSHB exam topics and syllabus

Test Name: Cloudera Certified Specialist in Apache HBase
Current Version: CCB-400
Number of Questions: 45
Time Limit: 90 minutes
Passing Score: 69%
Languages: English, Japanese

 

Core HBase Concepts
Recognize the fundamental characteristics of Apache HBase and its role in a big data ecosystem. Identify differences between Apache HBase and a traditional RDBMS. Describe the relationship between Apache HBase and HDFS. Given a scenario, identify application characteristics that make the scenario an appropriate application for Apache HBase.

Data Model
Describe how an Apache HBase table is physically stored on disk. Identify the differences between a Column Family and a Column Qualifier. Given a data loading scenario, identify how Apache HBase will version the rows. Describe how Apache HBase cells store data. Detail what happens to data when it is deleted.

Architecture
Identify the major components of an Apache HBase cluster. Recognize how regions work and their benefits under various scenarios. Describe how a client finds a row in an HBase table. Understand the function and purpose of minor and major compactions. Given a region server crash scenario, describe how Apache HBase fails over to another region server. Describe RegionServer splits.

Schema Design
Describe the factors to be considered with creating Column Families. Given an access pattern, define the row keys for optimal read performance. Given an access pattern, define the row keys for locality.

API
Describe the functions and purpose of the HBaseAdmin class. Given a table and rowkey, use the get() operation to return specific versions of that row. Describe the behavior of the checkAndPut() method.

Administration
Recognize how to create, describe, and access data in tables from the shell. Describe how to bulk load data into Apache HBase. Recognize the benefits of managed region splits.

 

Sample Questions

Question 1

You want to store the comments from a blog post in HBase. Your data consists of the following:

a. the blog post id
b. the name of the comment author
c. the body of the comment
d. the timestamp for each comment

Which rowkey would you use if you wanted to retrieve the comments from a scan with the most recent first?

A. <(Long)timestamp>
B. <blog_post_id><Long.MAX_VALUE – (Long)timestamp>
C. <timestamp><Long.MAX_VALUE>
D. <Long.MAX_VALUE><timestamp>

Question 2

Your application needs to retrieve 200 to 300 non-sequential rows from a table with one billion rows. You know the rowkey of each of the rows you need to retrieve. Which does your application need to implement?

A. Scan without range
B. Scan with start and stop row
C. HTable.get(Get get)
D. HTable.get(List<Get> gets)

Question 3

You perform a check and put operation from within an HBase application using the following:

table.checkAndPut(Bytes.toBytes("rowkey"),
Bytes.toBytes("colfam"),
Bytes.toBytes("qualifier"),
Bytes.toBytes("barvalue"), newrow));

Which describes this check and put operation?

A. Check if rowkey/colfam/qualifier exists and the cell value "barvalue" is equal to newrow. Then return “true”.
B. Check if rowkey/colfam/qualifier and the cell value "barvalue" is NOT equal to newrow. Then return “true”.
C. Check if rowkey/colfam/qualifier and has the cell value "barvalue". If so, put the values in newrow and return “false”.
D. Check if rowkey/colfam/qualifier and has the cell value "barvalue". If so, put the values in newrow and return “true”.

Question 4

What is the advantage of the using the bulk load API over doing individual Puts for bulk insert operations?

A.Writes bypass the HLog/MemStore reducing load on the RegionServer.
B.Users doing bulk Writes may disable writing to the WAL which results in possible data loss.
C.HFiles created by the bulk load API are guaranteed to be co-located with the RegionServer hosting the region.
D.HFiles written out via the bulk load API are more space efficient than those written out of RegionServers.

Question 5

You have a “WebLog” table in HBase. The Row Keys are the IP Addresses. You want to retrieve all entries that have an IP Address of 75.67.12.146. The shell command you would use is:

A. get 'WebLog', '75.67.21.146'
B. scan 'WebLog', '75.67.21.146'
C. get 'WebLog', {FILTER => '75.67.21.146'}
D. scan 'WebLog', {COLFAM => 'IP', FILTER => '75.67.12.146'}

Answers

Question 1: B
Question 2: D
Question 3: D
Question 4: A
Question 5: A

Install Boost library

Boost provides free peer-reviewed portable C++ source libraries.

Download the latest version from


Unzip the boost library in some directory say

unzip boost_1_53_0.zip

Move it to /usr/local so that everyone can use it

sudo mv boost_1_53_0 /usr/local/

To use boost from code inlucde whatever code you want to use

#include <boost/some.hpp>

All the libs are in boost sub directory

No compilation is required to use it as such

In case you want to install via rpm , then you can download it from 

http://rpmfind.net/linux/rpm2html/search.php?query=boost

Also search for boost-devel rpm as you need it for header files