Jugnu Life :-): April 2012

Eclipse SVN configuration for JavaHL

Failed to load JavaHL Library.
These are the errors that were encountered:
no libsvnjavahl-1 in java.library.path
no svnjavahl-1 in java.library.path
no svnjavahl in java.library.path
java.library.path = /usr/lib/jvm/java-6-openjdk/jre/lib/i386/client:/usr/lib/jvm/java-6-openjdk/jre/lib/i386::/usr/java/packages/lib/i386:/usr/lib/jni:/lib:/usr/lib

You can download the same from
sudo apt-get install libsvn-java

Mount HDFS using Fuse

The tutorial links below tells how to install hadoop fuse dfs to create mount of HDFS

http://cloudblog.8kmiles.com/2012/01/09/hadoop-cdh3-mount-hdfs/

https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS

http://wiki.apache.org/hadoop/MountableHDFS

Namenode Version File Structure

Namenode contains the following files in

$ data.name.dir or in new parameter data.namenode.name.dir

current (Folder)

version

fsimage

fstime

edits

Image (Folder)

in_use.lock

In this post only version file is being discussed , rest are the topics for other posts

The contents of my version file are shown below

#Thu Apr 12 22:10:30 IST 2012
namespaceID=36961221
cTime=0
storageType=NAME_NODE
layoutVersion=-18

The significance of each is as follows

storageType donates the type of role this machine is being playing in the hadoop cluster . This can take value defined in HdfsConstants.NodeType The values being NAME_NODE and DATA_NODE

Other 3 values are defined using the StorageInfo

cTime defines the time when of creation of NN storage

namespaceID defines the unique identifier which is created after NN is formatted

layoutVersion is pretty interesting , it is a negative number representing HDFS persistent datastructure. Whenever the layout changes this version is decremented. The more for this will be discussed when writing about Upgrades of cluster

Some text from official documentation

Storage information file.

Local storage information is stored in a separate file VERSION. It contains type of the node, the storage layout version, the namespace id, and the fs state creation time.

Local storage can reside in multiple directories. Each directory should contain the same VERSION file as the others. During startup Hadoop servers (name-node and data-nodes) read their local storage information from them.

The servers hold a lock for each storage directory while they run so that other nodes were not able to startup sharing the same storage. The locks are released when the servers stop (normally or abnormally).

Hadoop Daemons

This post explains the functions of different types of Hadoop Daemons which run inside Hadoop cluster.
Note: This covers the Hadoop architecture for Hadoop version <1.x series . For latest architecture Hadoop 0.23 ( 2.x ) series there would be separate post.

When we submit job to Hadoop cluster our task go to Job Tracker which handles assigns the tasks to various Task Trackers.
The TaskTrackers start map reduce jobs and complete the processing for the submitted job.
Job Tracker also keeps track incase something fails at any TT and assign that part of job to someone else
There is only one JobTracker in the system

Namenode keeps track of data blocks for particular file . File blocks are actually stored
in Datanodes which tell NN from time to time about health of blocks , and which all blocks
are stored in them.
Incase some block is lost due to some reason , NN can replicate that from copy of block which
is stored in some other machine
In core-site.xml and mapred-site.xml we specify the hostname and port of the NameNode and the JobTracker.

Data integrity in hadoop

Hadoop maitains data integrity by regular checksums of the same.

Starting Hadoop version 0.23.1 the default checksum algorithm has been changed to CRC32C. It is more efficient version than CRC32 which has been used in previous versions. You can read more about this change at JIRA HDFS-2130

Now coming to how it works

The property io.bytes.per.checksum controls for how many bytes the check CRC code is calculated. In the new version of hadoop this property has been changed to dfs.bytes-per-checksum . The default value of this is 512 bytes.It should not be larger than dfs.stream-buffer-size

Data nodes calculates the checksum for data it receives and raises CheckSumException if something is wrong

Whenever client reads data from DN , it responds back that checksum has been verfied to be correct and DN also updates the checksum log present in itself keeping info about when was the last time when checksum for the block was verified

Every DN also regularly runs DataBlockScanner which verifies the data stored in it for health.

Whenever a corrupted block is found NN starts replication of it from some healthy block so that required replication factor ( 3 by default , it means each block has 3 copies in HDFS) can be achieved.

While reading any file from HDFS using FileSystem API , we can also tell that we dont want to verify the checksum (for some reason) during this transfer.

Hadoop API

/hadoop-0.23.1/share/doc/hadoop/api/org/apache/hadoop/fs/FileSystem.html#setVerifyChecksum(boolean)

void setVerifyChecksum(boolean verifyChecksum)
Set the verify checksum flag.

Hive index for performance

Indexed Hive

View more presentations from NikhilDeshpande

No output required from Hadoop Map Reduce

If you do not want any output for your mapreduce due to some reasons , you can set NullOutputFormat as output class

job.setOutputFormatClass(NullOutputFormat.class);

Pig UDF Library Collection

I will collect the set of Pig UDF where ever i find online at this page

1) http://sna-projects.com/datafu/

UDF for Dates , Pagerank , Bags , Geo , Hash , Numbers , Sessions , Stats

2) Amazon Pig UDF Library

http://aws.amazon.com/code/2730

If you find something interesting and would like to add anything to this page please comment below , or write to me at jagatsingh [at] gmail dot com

Thanks

DataNode: Incompatible namespaceIDs

Error

DataNode: java.io.IOException: Incompatible namespaceIDs

The issue is related to JIRA https://issues.apache.org/jira/browse/HDFS-107 , it comes when we format the namenode but the data directory of datanodes is still having the old references and namespace IDs

When you format the namenode you have to take care of data directories

Either remove its content
or decide on new place for datanode directories

Start Hive Thrift Server

Hive Thrift server is very useful for integrating HIVE with applications which can talk to Thrift.

Its very much used by reporting tools such as Pentaho

To Start HIVE thrift server just type

# hive --service hiveserver

It would show message like

Starting Hive Thrift Server

To check it hive server has been started successfully

Type

#netstat -nl | grep 10000

Some service must be running there.

Hive do have compatibility issues with latest version of Hadoop. So if your thirft server is not starting just check it hadoop version is supported or not.