Jugnu Life :-): Hadoop

Showing posts with label Hadoop. Show all posts

Upgrading Large Hadoop Cluster

Long-time back, I wrote one post on about Migrating Large Hadoop cluster in which I shared my experience about how we did migration between two Hadoop environments. Last weekend, we did another similar activity which I thought to document and share.

Epilogue

We are a big Telco and we have many Hadoop environments and this post is about the upgrade story of one of the clusters we have.

Many weeks before The Weekend

For many many days, we were working to do upgrade one of our main Hadoop platforms. Since it was a major upgrade from version 2.6.4 to 3.1.5 HDP stack, it needed a lot of planning and testing. There were many things that helped us to face the D-day with confidence which I wanted to share.

Practice upgrades

We did 3 practice upgrades in our development environment to ensure we know exactly each and every step how it will work and what kinds of issues we can face. A comprehensive knowledge base for all known errors and solutions was made based on this exercise. This document was shared with all team members involved in the upgrade activity so that someone will remember when we see an issue during the real upgrade. It does become extremely challenging when you get an issue and the clock is ticking to bring the cluster back up for Monday workload. We also did one full team upgrade practice run so that all team members know what steps and sequences are involved to get into a real one and everyone gets a feel of it.

Code changes

We did all the code changes required to ensure our existing applications can run comfortably in the new platform stack. Testing was done in the development environment was stood up with new stack versions. We opened that to all applications teams and use cases for testing the work they do.

Meeting the pre-requite for upgrade

One of the challenges we have is a massive amount of data. Being a Telco company, our network feeds can fill in cluster very quickly. We had to keep strong control over what data comes in and what queries users run to keep the total used cluster storage under 85%, a single wrong user query can fill in a cluster within hours. Our cluster is of decent size around 1.8 PB. So, moving the data when we are overusing HDFS to some other environment is also a normal flow for us.

One week before The weekend

Imaginary upgrade

We did an exercise in which we brainstormed a fictitious upgrade and tried to get into the mindset of what steps and sequences we will do to do an upgrade. We listed every minor thing which came to our mind right from raising change request to closing change request after the competition of upgrade. This imaginary exercise helped us to bring to our attention many things that were not planned earlier and allowed us to line our ducks into a make a perfect order of steps to be executed.

Applications upgrade and use case teams

In a large shared cluster environment finding all job dependencies and applications that are impacted, is a challenge. We started sharing bulk communication with all users of the platform for the planned upgrade 1 month in advance so that we get attention for all users eventually and applications that run on top of the platform to remind them about upcoming downtime for the system.

Data feeds redirection

Many data feeds inside the Telco space are very big. We have the opportunity to capture them once only and if we don’t, we lose that data. To prepare for the downtime we planned for the redirection for the same to the alternative platform with a view to bring them back to the main cluster post-upgrade. This exercise needs attention and proper impact analysis to find if feeds can be lost permanently, or we can grab them from the source later down in the future.

The time roaster

Few days before the upgrade we made a timeline view of the upgrade weekend. The goal was we can bring people in and out during the weekend giving them rest as required. We divided into people who come before upgrade into the picture to redirect and stop data feeds, people who do upgrade, people who come into picture post-upgrade to resume jobs, and stop data feed redirection. Besides the above group, we also had a group of people to act as a beta tester for testing all user experience items over the weekend This group structure gave a clear idea when people are entering the scene and deliver what is expected from them

The Weekend

Friday

We divided the upgrade into 8 different stages and decided to do a split for the whole upgrade with a goal of doing the Ambari upgrade on Friday and doing as much as possible on Friday from the subsequent stages. Ambari upgrade was easy and we did not hit any blocker and we were done with it within our planned time.

Saturday and Sunday

Our original estimate for the HDP and HDF upgrade based on my past experiences of upgrades was around 20 hours. But due to 3 technical issues we faced our timelines got pushed by 15 hours. Cloudera on-call engineers were very responsive to assist us with those problems. Hadoop is a massive beast, no single person can know all the things, so having access to SMEs from Cloudera when we needed was a massive morale booster for us. It was like we have someone to call if we need to, and they did jump in to resolve all the blockers we got. So, a massive thank you to the Cloudera team.

Credits

Collaboration and COVID

This upgrade has been different for us. Due to COVID like all companies worldwide we have been working remotely for the past many weeks. Without giving credit to Microsoft Teams for this, it will not be fair. Microsoft Teams made is possible since day 1 of work from home environment that we could work effectively. Our team of core 4 people involved in the upgrade was all hooked into one Team's meeting session for 3 days. We used screen sharing, document sharing features of Teams to make it easier for us to get the job done.

Kids and families

Lastly, it's worth mentioning the patience of our families who brought all meals next to the computer so that we could work and taking care of kids during long working hours. With Teams meeting broadcasting for many hours we could hear each other's kids (except for 1, who is a Bachelor :) ) shouting and trying to grab attention and wanting us to move away from the keyboard. With this upgrade over now we are back spending more and more time with them.

Weekend + 1 Monday

The upgrade has been successful, project teams, users are slowly coming back live on the platform. The users are reporting issues they are facing, and we are incrementally fixing them. Data has started to flow back into the platform, with flood gates of massive feeds to be opened later during the week and things are slowly getting back to normal. Our users are excited with lots of new functionality this upgrade brings and I am proud of what we have achieved.

Massive planning, practice exercise has delivered a good outcome for us. We have missed planning for a few things, but we will learn from them, that is what life is, isn’t it?

Until next upgrade, goodbye.

Thank you for reading. Please do leave a comment below

Using R and Hadoop Bigdata

Following packages helps working with Bigdata from R.

1. rmr2
2. rhdfs
3. HadoopStreamingR
4. Rhipe
5. h2o
6. SparkR

The links to documentation and tutorials for each of them are below.

All the packages work on the basis of Hadoop Streaming to run the work on cluster instead of single R node. If you are new to Hadoop read the basics of Hadoop Streaming on https://hadoop.apache.org/docs/stable2/hadoop-streaming/HadoopStreaming.html and short tutorial on writing jobs which run using Python. http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

Package Name

Useful Tutorials / Readings

rmr2

Web links

https://github.com/RevolutionAnalytics/rmr2/blob/master/docs/tutorial.md

Book

R in Nutshell 2nd edition ( Chapter 26 )

http://shop.oreilly.com/product/0636920022008.do

rhdfs

Wiki

https://github.com/RevolutionAnalytics/RHadoop/wiki/user%3Erhdfs%3EHome

HadoopStreamingR

Cran package documentation

https://cran.r-project.org/web/packages/HadoopStreaming/HadoopStreaming.pdf

Rhipe

Web links

http://tessera.io/docs-RHIPE/#install-and-push

h2o

Documentation using h2o from R

http://h2o-release.s3.amazonaws.com/h2o/rel-slater/1/docs-website/h2o-docs/index.html#%E2%80%A6%20From%20R

R h2o package documentation ( ~140 pages )

http://h2o-release.s3.amazonaws.com/h2o/rel-slater/1/docs-website/h2o-r/h2o_package.pdf

SparkR

Api

https://spark.apache.org/docs/latest/api/R/index.html

Documentation

https://spark.apache.org/docs/latest/api/R/index.html

Install and configure Httpsfs hadoop service

Build HTTPFs package

Clone the Hadoop code

git clone https://github.com/apache/hadoop.git
cd hadoop/hadoop-hdfs-project/hadoop-hdfs-httpfs/
mvn package -Pdist -DskipTests
cd target/
tar -cf hadoop-hdfs-httpfs-2.6.0.tar.gz hadoop-hdfs-httpfs-2.6.0

Go to server where you want to setup Https server

Extract the tar file

tar -xvf hadoop-hdfs-httpfs-2.6.0.tar.gz
cd hadoop-hdfs-httpfs-2.6.0

In your cluster manager ( e.g Ambari / Cloudera manager etc)

Make change to core-site.xml so that , httpfs user can do proxy

<property>
<name>hadoop.proxyuser.#HTTPFSUSER#.hosts</name>
<value>httpfs-host.foo.com</value>
</property>
<property>
<name>hadoop.proxyuser.#HTTPFSUSER#.groups</name>
<value>*</value>
</property>

Change #HTTPFSUSER# to extact user who will be starting httpfs service.

Restart cluster services

Copy core-site.xml and hdfs-site.xml from your cluster to /etc/hadoop directory

Start Httpfs

./sbin/httpfs.sh start

Check

curl -i "http://<HTTPFSHOSTNAME>:14000?user.name=YourHadoopusername&op=homedir"

Output should be

HTTP/1.1 200 OK
Content-Type: application/json
Transfer-Encoding: chunked

{"homeDir":"http:\/\/<HTTPFS_HOST>:14000\/user\/YourHadoopusername"}

Operating system level tuning for Hadoop

1)

THC should be disabled

echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag
echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
Add the above to /etc/rc.local

2)

The noatime mount for disks to speed up reads

3)

The open file limit ulimit to high number 32000

4)

Turn off caching on controller

5)

net.core.somaxconn=1024
Socket listen size

6)

Disable SE Linux

SELINUX=disabled

7)

umask 022

8)

sysctl.conf

vm.swappiness=0
vm.overcommit_memory = 1
vm.overcommit_ratio = 100

9)

Disable ip6

10)

Host DNS is properly set

References

http://www.slideshare.net/vgogate/hadoop-configuration-performance-tuning

Hadoop Yarn tuning calculator

Yarn tuning calculator

I just tarted one excel sheet where we can plug in the nodes numnber of cores , disks , RAM and
it will give us the values for various yarn properties.

It is based on Hortonworks suggestions.

You can see it at

https://goo.gl/gzSz27

Hortonworks also has the python srcipt.

https://github.com/hortonworks/hdp-configuration-utils

But keep in mind the fact that python script does not takes into consideration for things
which are already running in machine.

Those things we can configure manually in excel sheet

Why my Hive Sqoop job is failing

Find few basics about cluster from your Administrator about cluster configuration.

Sample talk Example can be

How many nodes cluster has , what are its configuration

Answer can be

Each node has 120GB RAM . Out of that memory which we can ask for our jobs is about 80GB
We have 14 cpu cores in each datanode , we have 5 right now the maximum we can ask for processing from each datanode is 8 cores

Leaving rest for other processes like OS / Hadoop / Monitoring services

When we run any job in Mapreduce world you will get minimum RAM of 2GB and Max any task can ask for is 80GB ( See the capacity above)

Given the fact that we are running big jobs for one off loads please tell the system to give you higher RAM for your job. ( RAM in increments of 1024 MB)

Besides RAM you can also ask how many CPU cores you want.

The max cores which given node can provide is 8 for processing

This can be controlled via following parameters

Hive jobs

mapreduce.map.memory.mb
mapreduce.reduce.memory.mb
mapreduce.map.java.opts
mapreduce.reduce.java.opts
mapreduce.map.cpu.vcores
mapreduce.reduce.cpu.vcores

Typically if your jobs fails while inserting to hive query please see check if you need to tune any memory parameter. Hive insert jobs are reduce jobs.

Since you are inserting large amount of data in one go you will face the issues of memory overrun.

Always see the logs , its always mentioned there why the job is failing.

Sqoop jobs

Sqoop jobs spawn only map jobs.

So if Sqoop job is not moving through the indicator is following

Memory issue on our side.

So just add the following parameter in the code

-Dmapreduce.map.memory.mb=5120 -Dmapreduce.map.speculative=false

Tune the above 5120 parameter to based no need.

Where to see logs and job status

You can see whats the status of your job and logs at Resource manager

http://resource:8088/cluster

You can also login to Ambari to see what value has been set as default for given property

http://ambari:8080

Ask for Username and password with readonly access from your administrator

Find out what current default values are

mapreduce.map.java.opts=-Xmx5012m
mapreduce.reduce.java.opts=-Xmx6144m
mapreduce.map.memory.mb=4096
mapreduce.reduce.memory.mb=8192

What if , my job is not even accepted by the cluster :)

You are asking for resources which cluster don't have. Means its crossing the max limit of the cluster. So check with your job what its really asking for and what the cluster can provide.

Why is my job being killed ?

If your job is crossing the resource limit which it has originally asked for from the RM the Yarn will kill your job.

You will see something like below in logs

Killing container....

Remember Google is your friend

Moment you see your job has failed see the error from the above logs and search in Google , try to find which parameters people have suggested to change in the job.

Resources

https://altiscale.zendesk.com/hc/en-us/articles/200801519-Configuring-Memory-for-Mappers-and-Reducers-in-Hadoop-2

Hadoop conf files in Pivotal Hadoop

Pivotal stores files in different location then the default

Actual binaries can be found under path

/usr/lib/gphd/

Example

/usr/lib/gphd/sqoop/

Conf files can be found under path

/etc/gphd

Example

/etc/gphd/sqoop/conf

Design and Architecture considerations for storing time series data

Design and Architecture considerations for storing time series data

Know your Access patterns in advance

Are we going to do analysis on full day of data or data for just one hour. Having advance notes on what would be the use cases with which data will be used is highly recommended.
The granularity of information required by the client application helps deciding underlining data model for storing the information.
Frequency with which data is generated
Identify the speed with which time is being produced by the source system. Do you produce multiple data points every second.

Though we might need to persist all the time series data but more than often we don't need to store each data point as a separate record in the database.

Most of the time series problems are similar in nature and predominant issues comes when we need to Scale the system . Making the systems evolve with changing schema is another dimension which adds to the complexity. All the problem show similar patterns with only variations in the data model.

If we can define a time window and store all the readings for that period of time as an array then
we can significantly cut the number of actual records persisted in the database, improving the performance of the system.

For example
Stocktock tick information is generated once a second for each stock i.e. 86K ticks per stock per day.
If we store that many records as separate rows then the time complexity to access this information would be huge, so we can group 5 minutes or 1 hour or one day worth of records as a single vector record.

The benefits of storing information in larger chunks is obvious as you would do way fewer lookups into the
NoSQL store to fetch the information for a specific period of time. Other point to remember is that
if your window size is very small then you will be doing a lot of read/write operation and if it is
too big then durability would be a concern as you can lose the information in the event of system failure.

So you need to balance out both the forces.

There is no single size fit each time series problem is different.Fine tune the system based on the requirements and the access patterns.

If the access pattern changes in future then you might have to re-index, re-calculate the array size to optimize your queries.

So each time-series application is very custom made, where you can apply the best practices but cannot just import the data modelling templates to a different time-series problem.

Links
http://www.infoq.com/articles/nosql-time-series-data-management
https://blog.twitter.com/ko/node/4403

Migrating Large Hadoop Cluster

“The text book launch” , any Indian would have heard this phrase The text book launch, often used by ISRO after successful launch of satellites into space.

Last weekend we did similar exercise , The textbook hadoop cluster migration. This blog posts shares thoughts and experiences around the same.

Just for story making i am masking customer specific details.

Epilogue

It will be two years this June (2014) when I landed in this totally new city for me , Sydney. I came to do implementation of Hadoop for this large financial customer. This journey has been great in-fact awesome learning experience and opportunity to work with best brains of the world.

The Weekend Task

We had to migrate to new cluster as we were almost out of space and also needed more computation power. While writing this blog post i used the term WE , it has been combined effort of awesome bunch of people.

Old Cluster configuration 11 Nodes having about 250 TB space

New Cluster 25 Nodes

The two cluster were running different versions of Hadoop , so move was from CDH 4.3 to CDH 4.6 and luckily there was binary compatibility between these two releases.

The cluster also runs Hbase in production and has replica mirror cluster ( 3rd cluster)

The Plan

This activity had in background lots of planning around

Testing of the sample production jobs on new cluster
Total Automation and setup of new cluster via Cloudera manager API
Code Migration of over 300 production oozie jobs which pump data into cluster and produce scores
Trickle feed resuming the jobs in batches
Roaster for shifts in which people will work to complete the activity within one weekend

Good clear communication on what is happening for all the people involved was very important , Given the fact that our team was spread across Sydney and India.

Besides traditional email we also used whatsapp for communication so that all people can be aware of things and can read

messages when they are available. ( not to awake someone from sleep , right after completing roaster shift )

The downtime of the cluster was started from Friday evening , data copy via distcp was started few days back during nights so that in weekend we can use distcp in update mode to transfer the newly added data.

The Action

New Cluster setup

Given the experience with managing old cluster , it was priority of the team that any new cluster bought would be setup via 100% automation , the new cluster configuration , setup is driven via puppet and Cloudera Manager API. The non hadoop components are installed via normal Puppet packages. After the setup of the new cluster via automation the machines were ready to be loaded.

Data Migration

During the Friday night distcp was started in update mode to complete the data copy process moving all onto new cluster. Overnight stay in office was planned and distcp with World Cup soccer match was good combination :) Since HBase is also part of the production we had to move it. To move HBase we used the similar distcp copy process , we had brought down everything (except HDFS and MR) from source cluster. This approach is to be used if and only if the HBase is down on both sides.

See the steps mentioned here on Apache wiki

http://wiki.apache.org/hadoop/Hbase/MigrationToNewCluster

The distcp was complete by 7:00 AM Saturday morning and full data migration was verified by taking full folder tree dump on source and destination cluster and compared the sizes.

The job of overnight stay team was over and with the dawn of Saturday morning the job of Code migration team was about to begin.

Code Migration

The activity to sample test the existing production jobs was carried out few days in advance. This allowed us to find the issues from binary compatibilities to

network issues for Sqoop jobs which needs firewall to be opened for talking to new cluster.

Given the large number of production jobs testing and being confident that everything would work 100% in new cluster was challenging task.

One of the main focus was to capture what's running at this moment in the cluster ( Thanks to our very strict Development manager :) ) We ran extensive checks to capture the current state of code in the system and moved back to code repository.

From over 300 jobs we found 3 oozie workflows which were having

definitions out of date in our code repo from what was running in production. With large number of property files , oozie workflow testing can be difficult.

I will write a new blog post of learning about best practices in handling large number of oozie workflows , especially around regression testing and structuring the oozie code.

Oozie is awesome tool , we never had issues and many of jobs were running silently from over 12 months now.

We created new branch after dumping the current state of production cluster code and started changing the old code base to configuration specifications of new cluster.

By late Saturday afternoon the code migration was complete and ready to be run onto the new cluster. We ran our first job onto new cluster with success and shared the news with all over watsapp.

Whatsapp group messaging has worked very well I think , keeping all team members aware of current happening in base camp.

By late evening work of code migration shift was over and new team arrived to take over for resuming the production jobs.

Trickle resuming the production jobs

The cluster is under SLA for downstream consumers , we started resuming the production jobs which were having highest priority. The team managing the platform for operations had now control in hands. They started verifying the oozie jobs after starting them incrementally into new cluster. We have our own custom job tracking database , so writing one simple query into MySQL gave clear view which job is having problem and needs attention.

Hive and Metadata

We took the dump of Hive MySQL metastore from the old cluster , created corresponding database on the new cluster. Since there is mirror cluster we also configured MySQL replication for the same.

The closing thoughts

We did everything without any support call , ticket , Apache mailing list email. This shows that our team is capable and learned enough to deal with wide range of things now in the Hadoop ecosystem.

However there are few lessons which we need to learn from and make truly the textbook hadoop migration.

There is always something which is not in code base. Getting discipline in the large team that everything lands up into code base is the most difficult thing.

The last minute fixes often lead to changes which are running into Production but never find the way into source control. Hence we missed few minor things.

Although the changes did not affect our planned time lines since we knew how to fix the issues ( since we had seen time earlier too ) , we fixed them and task of adding the

fixes back to source control is the action point.

Automation , is the key to manage large number of multiple clusters. Platform team did awesome job in capturing the current state of old cluster and tuning required

properties in new cluster. All the deployment of new cluster was done via Puppet and Cloudera Manager API. So any configuration changes were also driven by the same.

The properties which code migration team found were missing during actual resumption of production jobs were passed back into the loop to be added to puppet.

Permissions , the jobs which failed running in the new cluster were due to permission issues. One of the action point we noted was to how to capture this information back into our Puppet so that all future deployments take care of folders , owners and permissions for us.

Oozie code structuring , Managing oozie code given the large amount of xmls and configurations can be difficult and this can be annoying at times. Understanding of concept of oozie globals , share lib and bundles is very important given the fact when you are using it in production for large deployments. I will write additional followup post for oozie code.

-----

Good bye for now , soon would be back with another post for the next task for moving on to YARN by end of next quarter.

Architecture pattern for real time processing in Bigdata

3Vrfc+I2EP5reOyMweCER+CgufbauR430+mjwALU2BYjy0nIX38rrBW2ZKghtpMrwzBojdf6vv0piZ4/i19+FWS/+4OHNOoNvPCl53/qDQb3ngefSnCwBFvBwlzU14KMhTQtiSTnkWT7snDNk4SuZUm24VFZ2Z5sqSNYrknkSv9modzpyQ2Ck/yBsu0OH9MPxvmVVB5QR0g3JIvkL0cRXFOXY4K6NMwXLx8Gd/n4oMd9Tyvck6Q0pVfO45JA0JS9lqe9YXpe+iErLkIqSqKIJY9F3vw5GElwDjeqb/HLjEbKUGiE/LbFmauGLkGT0qPP3aDpeCJRpqfu0Pe8Y5Iu92Stxs/gPD1/umFRNOMRz7H489Fstli4z9bTeaJCUu1oJ5DgiZTHVIqDoh/tp2nRXjjytTmeT8YPAv2bXdHwyCfRZG6N7hN6+KIJqCZDW/o6MhoAfW+BRh8sgEZ4RcxNQEa+O8c8tjDfdYgZZ9sx5gGmVMR83yHmmwK9Acx2QI87xOy/E2acvMYceB1i1oWxgBnQiUcQ/cZX8PltvvyuKKACAN2S68fjBbwaoGk0LNPkV+T6gV/BE6bLN/Gkn3XJN9Id2auvWRxN1lLhnyp0DLqTL2RFo688ZZJx1RasuJSqHTA/mERsqy5I3oRH3VlMDSqqYlVRxITzJqbuz3rU8q8vt3jQdDz3JqNGiojWgrwMXV5MY1AkBsl6Cy9muhc8iIbQzOphwhN6FrP64UXERURo1UpTCxoRyZ7KfWgVTv2Mr5zBXE65C6nRlA4xUFBFyjOxpvquYmNpKRrZxQ59FhVJIrZUOoqOtBvg9SwxdCzxOVnzmCVbtaAhkjiGAQcD4qZER+ka7KH6cid8YxaG6p5pE2UBcxkyUuWtVbbFPPkWb/V1pFzy1q3g2RXZyqzwyAo1HCd6PuHbBIxcAjC3N53vBzXwwx2whlXWrtEVGEL/2/4dlH1cD78HPLxaNi4OuwBfo5a3DN7q+HDYBXi3PHcN3gprHHYBvsaKvWXwVhOLww7AI8/vCH5kJfQOwdfpvtoFb3VKOOwCfI1sT4Tgzz9Bv2ktbQJPY7u637Tqj6OouX4T16kF+r8LkqQbLmIaHvd0gQ+4zqDx9HCxlEpBSd6TfoBm1LdCN8A+v5Nm1K1ZvlDEASSpJjnZw1nC+XDGhfk64hmorBPe1+9OWGvLfkV8V64tm+hW/Tp17adbWwbWnu/QzoV1Y91WZBYWzcc6uvv/zBJW7TKV+2pLWIr8Fi3hVvyHT4vlLRtQ6rBqbI71GjyuCipWtJXHVU1sQA3dLf1/Jt/+vIUQtaPbCiF3FcvA9ghx9/sfshgKM8gyKBx6w9JbkfRYpmE041Brofp+hJJsqgkGZcVaorWSPHT30JaHVNJYkZdmqxhOsgfev3ylxnljk58kfADiAmur0ThhF8T1bynV0MJIwR/NXxo0+R+6UbcjGw/yri0ZtqIAT7tb2Bh2izdsDAsaA8vk+MNkS9PjEY7KEMfeHFYN/gs5Oqb5uoiyGNKHGf8+2ag9ZdAObyNdkvgVpAUFl2Mjohs1bHXLeVRuYs2felqIDBie/raSG+v0RyN//gM=

The flow is as follows

1) Real time data is ingested into system via stream ingestion tools like Flume , Kafta or Samza. The choice of which tool to be used among these dependent on factors like do we need event guaranteed delivery , are we bothered about sequence of events. etc. This is topic for another posr

2) Apply the processing required on the incoming data via Spark streaming api and make it ready to be consumable by third party apps

3) Using the REST gateway job server of Spark , third party apps trigger further spark jobs to process the data and return results back to the applications.

Does not contain a valid host:port authority: logicaljt

Error

Exception in thread "main" java.lang.IllegalArgumentException: Does not contain a valid host:port authority: logicaljt

Solution

Check that your hadoop supports the Job Tracker HA

Check the conf files of Hadoop

Making oozie hbase work with Kerberos enabled cluster

At the top of workflow add

<credentials>
<credential name='hbaseauth' type='hbase'>
</credential>
</credentials>

Within any action add the details about credentials.

<action name="process" cred="hbaseauth">

Also add details about hbase-site.xml

<job-xml>${hbaseSite}</job-xml>
<file>${hbaseSite}#hbase-site.xml</file>

Complete example

<action name="process" cred="hbaseauth">
        <java>
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <job-xml>${hbaseSite}</job-xml>
            <configuration>
                <property>
                    <name>mapred.job.queue.name</name>
                    <value>${queueName}</value>
                </property>
            </configuration>

            <main-class>${process_classname}</main-class>
           <file>${hbaseSite}#hbase-site.xml</file>
            <capture-output/>
        </java>
        <ok to="success"/>
        <error to="failed"/>
    </action>

Hadoop 2.3 Centralized Cache feature comparison to Spark RDD

Hadoop 2.3 has two new features.

Support for Heterogeneous Storage Hierarchy in HDFS (HDFS-2832)
In-memory Cache for data resident in HDFS via Datanodes (HDFS-4949)

This post is related to Centralized cache management feature in HDFS.

It allows you to say at start of your job to cache a particular folder into memory. Applications like Hive , Impala will be able to read data directly from memory which has been cached. The current features are SCR ( Short Circuit reads ) which allows SCR aware applications directly read from disk by passing Datanode.

Sample command

$hdfs cacheadmin -addDirective -path <path> -pool <pool-name> [-force] [-replication <replication>] [-ttl <time-to-live>]

General flow of execution

Comparing with current implementation Spark model , RDD is still superior as it maintains lineage both transformations of writes happening in the in memory data. It means that Spark can write intermediate data to RAM and work faster.

The current HDFS cache management feature does only boosts performance with reads.

So I guess still few mode improvements are needed for Hadoop to beat Spark in performance.

I am very excited to see how downstream systems like Pig , Hive and Impala will use this feature to make them process things faster. I am sure things will get better and better in Hadoop in coming few releases.

Teradata meets Hadoop SQL-H

Teradata and Hadoop have been talking via Sqoop since long. In which Sqoop was used to import export data in and out of Hadoop and Teradata.

Teradata has new tool called SQL-H

SQL Hadoop which allows to connect directly to Hadoop from Teradata and perform operations.

The two links below explain all about them.

But before going to main link I will suggest you to have a quick read at Teradata architecture. ( only 2 mins read)

What is the function of

PE – Parsing Engine , parses the query entered by user

AMP –Do actiual disk operations

BYNET – Handles communication between AMP and PE

http://www.teradatatech.com/?p=103

Now comes one additional component EAH

EAH is the External Access Handler , which talks to HCatalog on Hadoop side to get information related to table

You can run queries like

SELECT Price

, CAST(Make AS VARCHAR(20))

, CAST(Model AS VARCHAR(20))

FROM LOAD_FROM_HCATALOG(

USING

SERVER('sdll4364.labs.teradata.com')

PORT('9083')

USERNAME ('hive')

DBNAME('default')

TABLENAME('CarPriceData')

COLUMNS('*')

TEMPLETON_PORT('1880')

) as DT;

Where server is details about hadoop cluster

And we are saying in Teradata , use Hcatalog and talk to Hadoop execute this query bring back the results.

There is very good presentation which explains more

https://teradata.activeevents.com/connect/fileDownload/session/861E603F0E963A510420A734D9102127/1436_Back-1436_Frazier-1436%20Frazier%20Partners%202013%20SQL-H.pdf

Where is hadoop examples jar

Okay

I admit I spent lot of time searching this so I am documenting as blog.

Hadoop examples jar is present at following path on Red Hat systems

/usr/lib/hadoop-0.20-mapreduce

Build and compile Hadoop from source

Install some base libraries

apt-get -y install maven build-essential protobuf-compiler autoconf automake

libtool cmake zlib1g-dev pkg-config libssl-dev

Checkout code from repo

svn co http://svn.apache.org/repos/asf/hadoop/common/trunk/ hadoop
cd hadoop/

mvn compile

This gave me

[ERROR] Could not find goal 'protoc' in plugin org.apache.hadoop:hadoop-maven-plugins:3.0.0-SNAPSHOT among available goals -> [Help 1]

https://issues.apache.org/jira/browse/HADOOP-9383

Then to debug i re run with

mvn compile -X

I found in log that i did not had version 2.5 of protobuf required to build hadoop

Follow the steps here to install protobuf on your system

http://jugnu-life.blogspot.com/2013/09/install-protobuf-25-on-ubuntu.html

Check the installed version of protobuf , its also documented in

https://svn.apache.org/repos/asf/hadoop/common/trunk/BUILDING.txt

Okay after fixing above ( if required )

Here is final build command

mvn package -Pdist -DskipTests -Dtar

Further reading

http://wiki.apache.org/hadoop/HowToContribute

Hadoop get filesize and directory from ls command

hadoop fs -ls /my/folder | awk '{print $8}' > only_directory.txt

You can just change the numbers above for getting different information

Example

hadoop fs -ls /my/folder | awk '{print $6, $8}' > only_date_directory.txt

Hive Shark Impala Comparison

Credits
This post is nothing but reproduce of work done here at AmpLabs. If you want latest and detailed read i suggest you to go there.
Bigdata world is so beautiful , the research in this field is driving at such a fast pace that eventually BigData is no more synonymous with long running queries , its becoming LiveData everywhere.
The work compares the computation time of Redshift , Hive , Impala , Shark with different types of queries

The performance of Shark in memory has been consistent in all 4 types. It would be interesting to see the comparison when Hive 0.11 is used as it also adds few performance improvements driven by work at Hortonworks.

Open Stack meets Hadoop

There has been good announcements this week.

Open Stack allows us to create cloud inside the corporate walls using solid foundations of standards setup world wide with collaboration of many good companies.

Now Mirantis has announced a project which allows you to provision Hadoop based clusters easily within organization. IT uses Ambari as its background engine to do all this magic.

Quoting some of the information and architecture at official website.

Cloudera Certified Specialist in Apache HBase CCSHB exam topics and syllabus

Test Name: Cloudera Certified Specialist in Apache HBase
Current Version: CCB-400
Number of Questions: 45
Time Limit: 90 minutes
Passing Score: 69%
Languages: English, Japanese

Core HBase Concepts
Recognize the fundamental characteristics of Apache HBase and its role in a big data ecosystem. Identify differences between Apache HBase and a traditional RDBMS. Describe the relationship between Apache HBase and HDFS. Given a scenario, identify application characteristics that make the scenario an appropriate application for Apache HBase.

Data Model
Describe how an Apache HBase table is physically stored on disk. Identify the differences between a Column Family and a Column Qualifier. Given a data loading scenario, identify how Apache HBase will version the rows. Describe how Apache HBase cells store data. Detail what happens to data when it is deleted.

Architecture
Identify the major components of an Apache HBase cluster. Recognize how regions work and their benefits under various scenarios. Describe how a client finds a row in an HBase table. Understand the function and purpose of minor and major compactions. Given a region server crash scenario, describe how Apache HBase fails over to another region server. Describe RegionServer splits.

Schema Design
Describe the factors to be considered with creating Column Families. Given an access pattern, define the row keys for optimal read performance. Given an access pattern, define the row keys for locality.

API
Describe the functions and purpose of the HBaseAdmin class. Given a table and rowkey, use the get() operation to return specific versions of that row. Describe the behavior of the checkAndPut() method.

Administration
Recognize how to create, describe, and access data in tables from the shell. Describe how to bulk load data into Apache HBase. Recognize the benefits of managed region splits.

Sample Questions

Question 1

You want to store the comments from a blog post in HBase. Your data consists of the following:

a. the blog post id
b. the name of the comment author
c. the body of the comment
d. the timestamp for each comment

Which rowkey would you use if you wanted to retrieve the comments from a scan with the most recent first?

A. <(Long)timestamp>
B. <blog_post_id><Long.MAX_VALUE – (Long)timestamp>
C. <timestamp><Long.MAX_VALUE>
D. <Long.MAX_VALUE><timestamp>

Question 2

Your application needs to retrieve 200 to 300 non-sequential rows from a table with one billion rows. You know the rowkey of each of the rows you need to retrieve. Which does your application need to implement?

A. Scan without range
B. Scan with start and stop row
C. HTable.get(Get get)
D. HTable.get(List<Get> gets)

Question 3

You perform a check and put operation from within an HBase application using the following:

table.checkAndPut(Bytes.toBytes("rowkey"),
Bytes.toBytes("colfam"),
Bytes.toBytes("qualifier"),
Bytes.toBytes("barvalue"), newrow));

Which describes this check and put operation?

A. Check if rowkey/colfam/qualifier exists and the cell value "barvalue" is equal to newrow. Then return “true”.
B. Check if rowkey/colfam/qualifier and the cell value "barvalue" is NOT equal to newrow. Then return “true”.
C. Check if rowkey/colfam/qualifier and has the cell value "barvalue". If so, put the values in newrow and return “false”.
D. Check if rowkey/colfam/qualifier and has the cell value "barvalue". If so, put the values in newrow and return “true”.

Question 4

What is the advantage of the using the bulk load API over doing individual Puts for bulk insert operations?

A.Writes bypass the HLog/MemStore reducing load on the RegionServer.
B.Users doing bulk Writes may disable writing to the WAL which results in possible data loss.
C.HFiles created by the bulk load API are guaranteed to be co-located with the RegionServer hosting the region.
D.HFiles written out via the bulk load API are more space efficient than those written out of RegionServers.

Question 5

You have a “WebLog” table in HBase. The Row Keys are the IP Addresses. You want to retrieve all entries that have an IP Address of 75.67.12.146. The shell command you would use is:

A. get 'WebLog', '75.67.21.146'
B. scan 'WebLog', '75.67.21.146'
C. get 'WebLog', {FILTER => '75.67.21.146'}
D. scan 'WebLog', {COLFAM => 'IP', FILTER => '75.67.12.146'}

Answers

Question 1: B
Question 2: D
Question 3: D
Question 4: A
Question 5: A