Problem
Each small file which is smaller than block size consume full block allocation , however actual disc usage is based on the size of file so don’t get confused that even small file will consume full block storage.
Each block consumes some amount of RAM of Namenode.
The Namenode can address fixed number of blocks which depend on RAM in it.
The small files in the Hadoop cluster are continuously hitting the Namenode memory limit.
So sooner or later we can have the problem which names condition that we can no longer add data to cluster as Namenode cannot address new blocks.
Let me summarize the problem with a simple example
We got 1000 small files each of 100KB each
Each file will have one block mapped against it. So Namenode will create 1000 blocks for it.
Whereas actual space consumption will be only 1000*100 KB ( assume replication =1 )
Whereas if we have combined file of 1000*100 KB then it would have consumed 1 block on Namenode.
Information about each block is stored in Namenode RAM. So instead of just 1 , we are consuming 1000 spaces in Namenode RAM.
Let’s see the solution part
Solution
Solution is straight forward , Merge the small files into bigger ones such that they consume full blocks.
I am listing the approaches which you can follow depending upon your choic
1) HDFSConcat
If the files have the same block size and replication factor you can use an HDFS-specific API (HDFSConcat) to concatenate two files into a single file w/o copying blocks.
http://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/HDFSConcat.java
https://issues.apache.org/jira/browse/HDFS-222
2) IdentityMapper and IdentityReducer
Write a MR job with Identity mapper and Identity Reducer which spits out what it reads but with less number of output files. You can set this by configuring number of reducers
3) FileUtil.copyMerge
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileUtil.html#copyMerge
4) Hadoop File Crush
Crush - Crush small files in dfs to fewer, larger files
http://www.jointhegrid.com/hadoop_filecrush/index.jsp
Sample invocation
hadoop jar filecrush-2.0-SNAPSHOT.jar crush.Crush -Ddfs.block.size=134217728 \
--input-format=text \
--output-format=text \
--compress=none \
/path/to/my/input /path/to/my/output/merge 20100221175612
Where 20100221175612 is random timestamp which you can at time of execution of script
Read this document for more details. I would recommend to go with this tool for solving this problem
http://www.jointhegrid.com/svn/filecrush/trunk/src/main/resources/help.txt
Advantages of Filecrush are
1) Automatically takes care of sub folders
2) Options to do in folder or off folder merges
3) Let you to decide the format in which you want output. Supported format are Text and Sequence files
4) Let you to compress data
5) Hive concatenate utility
If your Hive tables are using ORC format then you can partition using Hive
Find list of partitions which are modified during last 7 days.
SELECT
D.
NAME
,T.TBL_NAME,P.PART_NAME
FROM
hive.PARTITIONS P, hive.DBS D,
hive.TBLS T
WHERE
T.DB_ID = D.DB_ID
AND
P.TBL_ID = T.TBL_ID
AND
DATEDIFF(NOW(),FROM_UNIXTIME(P.CREATE_TIME))<=7;
Then run the Hive ALTER command to Concatenate for all the partitions which are needed
ALTER TABLE tbl_name PARTITION (part_spec) CONCATENATE;
Closing notes:
Many people also recommend HAR and Sequence files , you might go that way also if you want to change format of your data.
Thanks for reading , please also share your approach of solving this problem in comments below.
More reading references
http://stackoverflow.com/questions/11368907/hadoop-block-size-and-file-size-issue
http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
http://www.ibm.com/developerworks/web/library/wa-introhdfs/
http://amilaparanawithana.blogspot.com.au/2012/06/small-file-problem-in-hadoop.html
http://www.quora.com/How-can-I-merge-files-in-HDFS