Jugnu Life :-): April 2015

Why my Hive Sqoop job is failing

Find few basics about cluster from your Administrator about cluster configuration.

Sample talk Example can be

How many nodes cluster has , what are its configuration

Answer can be

Each node has 120GB RAM . Out of that memory which we can ask for our jobs is about 80GB
We have 14 cpu cores in each datanode , we have 5 right now the maximum we can ask for processing from each datanode is 8 cores

Leaving rest for other processes like OS / Hadoop / Monitoring services

When we run any job in Mapreduce world you will get minimum RAM of 2GB and Max any task can ask for is 80GB ( See the capacity above)

Given the fact that we are running big jobs for one off loads please tell the system to give you higher RAM for your job. ( RAM in increments of 1024 MB)

Besides RAM you can also ask how many CPU cores you want.

The max cores which given node can provide is 8 for processing

This can be controlled via following parameters

Hive jobs

mapreduce.map.memory.mb
mapreduce.reduce.memory.mb
mapreduce.map.java.opts
mapreduce.reduce.java.opts
mapreduce.map.cpu.vcores
mapreduce.reduce.cpu.vcores

Typically if your jobs fails while inserting to hive query please see check if you need to tune any memory parameter. Hive insert jobs are reduce jobs.

Since you are inserting large amount of data in one go you will face the issues of memory overrun.

Always see the logs , its always mentioned there why the job is failing.

Sqoop jobs

Sqoop jobs spawn only map jobs.

So if Sqoop job is not moving through the indicator is following

Memory issue on our side.

So just add the following parameter in the code

-Dmapreduce.map.memory.mb=5120 -Dmapreduce.map.speculative=false

Tune the above 5120 parameter to based no need.

Where to see logs and job status

You can see whats the status of your job and logs at Resource manager

http://resource:8088/cluster

You can also login to Ambari to see what value has been set as default for given property

http://ambari:8080

Ask for Username and password with readonly access from your administrator

Find out what current default values are

mapreduce.map.java.opts=-Xmx5012m
mapreduce.reduce.java.opts=-Xmx6144m
mapreduce.map.memory.mb=4096
mapreduce.reduce.memory.mb=8192

What if , my job is not even accepted by the cluster :)

You are asking for resources which cluster don't have. Means its crossing the max limit of the cluster. So check with your job what its really asking for and what the cluster can provide.

Why is my job being killed ?

If your job is crossing the resource limit which it has originally asked for from the RM the Yarn will kill your job.

You will see something like below in logs

Killing container....

Remember Google is your friend

Moment you see your job has failed see the error from the above logs and search in Google , try to find which parameters people have suggested to change in the job.

Resources

https://altiscale.zendesk.com/hc/en-us/articles/200801519-Configuring-Memory-for-Mappers-and-Reducers-in-Hadoop-2

Exception in thread "main" java.lang.UnsupportedClassVersionError: org/apache/maven/cli/MavenCli : Unsupported major.minor version 51.0

I got the below error on Maven

Exception in thread "main" java.lang.UnsupportedClassVersionError: org/apache/maven/cli/MavenCli : Unsupported major.minor version 51.0

Solution

Set the Java home

export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home

Git clone via ssh tunnel

Follow the steps mentioned in below post to configure ssh tunnel
http://jugnu-life.blogspot.com/2015/04/ssh-tunnell.html
Then in git

git config --global http.proxy 'socks5://127.0.0.1:9999'

git config --global https.proxy 'socks5://127.0.0.1:9999'

Now you can do the git clone

SSH Tunnell

I spent lot of time trouble shooting SSH tunnel problem.
I thought to jot down the notes for my future reference
The dynamic ssh tunnel can be set by simple command
ssh -vvvvv -D port username@remotehost
Example
sss -vvvvv -D 9999 jagatsingh@10.20.30.40
Now i firefox use the
Socks 5 proxy setting as
localhost 9999
Keep in mind of unselect all other proxy , e.g HTTP , HTTPS etc
I spent lot of time wasting on this

Errors
channel 2: open failed: administratively prohibited: open failed
debug2: channel 2: zombie
Resolution steps
Check on
/etc/ssh/sshd_config
On the remote host that you have enabled
TCP forwarding.
AllowTCPForwarding yes

The webpage below is good reference
http://www.slashroot.in/ssh-port-forwarding-linux-configuration-and-examples

Parallel download for file wget alternative

http://hortonassets.s3.amazonaws.com/2.2/Sandbox_HDP_2.2_VirtualBox.ova

On mac we can use

brew install aria2

aria2c -x 16 -s 16 http://hortonassets.s3.amazonaws.com/2.2/Sandbox_HDP_2.2_VirtualBox.ova

This will spawn 16 parallel connections

Source

http://stackoverflow.com/questions/3430810/wget-download-with-multiple-simultaneous-connections

Hadoop conf files in Pivotal Hadoop

Pivotal stores files in different location then the default

Actual binaries can be found under path

/usr/lib/gphd/

Example

/usr/lib/gphd/sqoop/

Conf files can be found under path

/etc/gphd

Example

/etc/gphd/sqoop/conf

Open source job dependency tools

I was looking for open source Alternatives for job dependency management.

Few things i found

Taskforest is a simple but expressive open-source job scheduler that allows you to chain jobs/tasks and create time dependencies. It uses text config files to specify task dependencies.

http://www.taskforest.com/

schedulix is the Open Source Enterprise Job Scheduling System, which meets the complex requirements of modern IT process automation.

http://www.schedulix.org/

Some other tools popular in Hadoop world

https://github.com/mesos/chronos

Azkaban is a batch workflow job scheduler created at LinkedIn to run Hadoop jobs. Azkaban resolves the ordering through job dependencies and provides an easy to use web user interface to maintain and track your workflows.
https://azkaban.github.io/

Luigi

Luigi is a Python package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.
https://github.com/spotify/luigi

Luigi allows you to run batch types of jobs with complex scheduling via Python code.
By default it supports running Hadoop , MySQL , Scalding , Spark etc jobs.
You can see the list of available configurations here
https://luigi.readthedocs.org/en/latest/configuration.html