Merging small files in Hadoop

Before even telling what to do for solving this , would like to summarize what is the problem and why it problem. If you already know details you can skip directly to solution.

Problem
Each small file which is smaller than block size consume full block allocation , however actual disc usage is based on the size of file so don’t get confused that even small file will consume full block storage.
Each block consumes some amount of RAM of Namenode.
The Namenode can address fixed number of blocks which depend on RAM in it.
The small files in the Hadoop cluster are continuously hitting the Namenode memory limit.
So sooner or later we can have the problem which names condition that we can no longer add data to cluster as Namenode cannot address new blocks.
Let me summarize the problem with a simple example
We got 1000 small files each of 100KB each
Each file will have one block mapped against it. So Namenode will create 1000 blocks for it. 
Whereas actual space consumption will be only 1000*100 KB ( assume replication =1 )
Whereas if we have combined file of 1000*100 KB then it would have consumed 1 block on Namenode.
Information about each block is stored in Namenode RAM. So instead of just 1 , we are consuming 1000 spaces in Namenode RAM.
Let’s see the solution part

Solution
Solution is straight forward , Merge the small files into bigger ones such that they consume full blocks.
I am listing the approaches which you can follow depending upon your choic
1) HDFSConcat
If the files have the same block size and replication factor you can use an HDFS-specific API (HDFSConcat) to concatenate two files into a single file w/o copying blocks.
http://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/HDFSConcat.java
https://issues.apache.org/jira/browse/HDFS-222
2) IdentityMapper and IdentityReducer
Write a MR job with Identity mapper and Identity Reducer which spits out what it reads but with less number of output files. You can set this by configuring number of reducers
3) FileUtil.copyMerge
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileUtil.html#copyMerge
4) Hadoop File Crush
Crush - Crush small files in dfs to fewer, larger files
http://www.jointhegrid.com/hadoop_filecrush/index.jsp
Sample invocation
hadoop jar filecrush-2.0-SNAPSHOT.jar crush.Crush -Ddfs.block.size=134217728  \
  --input-format=text  \
  --output-format=text \
  --compress=none \
  /path/to/my/input /path/to/my/output/merge 20100221175612
Where 20100221175612 is random timestamp which you can at time of execution of script
Read this document for more details. I would recommend to go with this tool for solving this problem
http://www.jointhegrid.com/svn/filecrush/trunk/src/main/resources/help.txt
 
Advantages of Filecrush are
1) Automatically takes care of sub folders
2) Options to do in folder or off folder merges
3) Let you to decide the format in which you want output. Supported format are Text and Sequence files
4) Let you to compress data
 
 
5)  Hive concatenate utility

If your Hive tables are using ORC format then you can partition using Hive

Find list of partitions which are modified during last 7 days.

SELECT D.NAME,T.TBL_NAME,P.PART_NAME FROM hive.PARTITIONS P, hive.DBS D, 
hive.TBLS T 
WHERE T.DB_ID = D.DB_ID AND 
P.TBL_ID = T.TBL_ID AND DATEDIFF(NOW(),FROM_UNIXTIME(P.CREATE_TIME))<=7;

Then run the Hive ALTER command to Concatenate for all the partitions which are needed

ALTER TABLE tbl_name PARTITION (part_spec) CONCATENATE;

Closing notes:
Many people also recommend HAR and Sequence files , you might go that way also if you want to change format of your data.
Thanks for reading , please also share your approach of solving this problem in comments below.
More reading references
http://stackoverflow.com/questions/11368907/hadoop-block-size-and-file-size-issue
http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
http://www.ibm.com/developerworks/web/library/wa-introhdfs/
http://amilaparanawithana.blogspot.com.au/2012/06/small-file-problem-in-hadoop.html
http://www.quora.com/How-can-I-merge-files-in-HDFS






































Vowpal Wabbit Install Tutorial Steps

VW (vowpal wabbit) has two dependencies , Boost libtary and Libtools.

So before making install you need to get them. Otherwise you would get below error.

fatal error: boost/program_options.hpp: No such file or directory

./autogen.sh: 8: ./autogen.sh: libtoolize: not found

Download Boost from

http://sourceforge.net/projects/boost/?source=dlp


libtoolize:

sudo apt-get install libtool


Lets do the VW install

Download the latest zip from github

https://github.com/JohnLangford/vowpal_wabbit


# Unzip the code

unzip vowpal_wabbit-master.zip

# Move code to central place for install

sudo mv vowpal_wabbit-master /usr/local/

# Go to directory where code is

cd /usr/local/vowpal_wabbit-master

# Install VW
sudo make install

# Go to directory where it was installed
cd /usr/local/bin/

# Change permissions so that all can use vw
sudo chmod 755 vw

Check vowpal wabbit install success by

Running vw command from command prompt

Done

How HBase major compaction works

Compaction is the process in which HBase combines small files (HStoreFiles) into bigger ones.

Its of two types

Minor : When it take FEW number of files which are placed together and make them one.

Major : When it takes all the files in region and make them one.

This post covers the major compaction.

If you want to read about minor compaction , please read other post. How HBase minor compaction works . I suggest you to read that first.

The following properties effect major compaction

hbase.hregion.majorcompaction

The time (in miliseconds) between 'major' compactions of all HStoreFiles in a region. Default: 1 day. Set to 0 to disable automated major compactions.

    Default: 86400000

hbase.server.compactchecker.interval.multiplier This property affects decision to get the number that determines how often (time interval) we scan to see if compaction is necessary.

The interval between checks is hbase.server.compactchecker.interval.multiplier multiplied by hbase.server.thread.wakefrequency.
hbase.server.thread.wakefrequency

Time to sleep in between searches for work (in milliseconds). Used as sleep interval by service threads such as log roller.

Default: 10000

 

Quoting from ( Discussion specific stuff i have removed )

http://apache-hbase.679495.n3.nabble.com/Major-Compaction-Concerns-tp3642142p3645444.html

Major compactions are triggered by 3 methods: user issued, timed, and size-based. 

Even if we disable time based major compaction we can hit size-based compactions where your config is disabling time-based compactions.  Minor compactions are issued on a size-based threshold. 

The algorithm sees if sum(file[0:i] * ratio) > file[i+1] and includes file[0:i+1]   if so. 

This is a reverse iteration, so the highest 'i' value is used.  If all files match, then you can remove delete markers [which is the difference between a major and minor compaction].  Major compactions aren't a bad or time-intensive thing, it's just delete marker removal.

Minor compactions will usually pick up a couple of the smaller adjacent StoreFiles and rewrite them as one. Minors do not drop deletes or expired cells, only major compactions do this.

Now that you have read what major and minor compaction is , optimizing the above parameters based on cluster profile is necessary which we would see in other post.   

Happy Hadooping :)

How HBase minor compaction works

Compaction is the process in which HBase combines small files (HStoreFiles) into bigger ones.

Its of two types

Minor : When it take FEW number of files which are placed together and make them one.

Major : When it takes all the files in region and make them one.

This post covers the minor compaction.

If you want to read about major compaction , please read other post. How HBase major compaction works . I suggest you to read minor compaction first.

Lets see what decides the term FEW in minor compaction

The following properties effect minor compaction

 

hbase.hstore.compaction.min Minimum number of StoreFiles per Store to be selected for a compaction to occur (default 2).
hbase.hstore.compaction.max Maximum number of StoreFiles to compact per minor compaction (default 10).
hbase.hstore.compaction.min.size Any StoreFile smaller than this setting with automatically be a candidate for compaction.
hbase.hstore.compaction.max.size Any StoreFile larger than this setting with automatically be excluded from compaction
hbase.store.compaction.ratio Ratio used in compaction file selection algorithm

 

The file which would be used for minor compaction is decided based on following logic

Note the size of file

selects a file for compaction when the file size <= sum(smaller_files_size) * hbase.hstore.compaction.ratio.

Quoting example from official book

 

Consider following configuration settings

    hbase.store.compaction.ratio = 1.0f
    hbase.hstore.compaction.min = 3 (files)
    hbase.hstore.compaction.max = 5 (files)
    hbase.hstore.compaction.min.size = 10 (bytes)
    hbase.hstore.compaction.max.size = 1000 (bytes)

The following StoreFiles exist: 100, 50, 23, 12, and 12 bytes apiece (oldest to newest). With the above parameters, the files that would be selected for minor compaction are 23, 12, and 12.

Why?

Remember the logic

selects a file for compaction when the file size <= sum(smaller_files_size) * hbase.hstore.compaction.ratio.

    100 --> No, because sum(50, 23, 12, 12) * 1.0 = 97.
    50 --> No, because sum(23, 12, 12) * 1.0 = 47.
    23 --> Yes, because sum(12, 12) * 1.0 = 24.
    12 --> Yes, because the previous file has been included, and because this does not exceed the the max-file limit of 5
    12 --> Yes, because the previous file had been included, and because this does not exceed the the max-file limit of 5.

Hope this helps in understanding HBase minor compaction

Hadoop Hadooping :)

HBase Default Configuration list

HBase Default Configuration list

The page below explains the default property files of HBase , what are there purpose.

The documentation says its build from source , so it should be always latest.

http://hbase.apache.org/book/config.files.html

I know you are source control guy , always want to see trunk data.

So here is that direct link

http://svn.apache.org/repos/asf/hbase/trunk/hbase-common/src/main/resources/hbase-default.xml