Each small file which is smaller than block size consume full block allocation , however actual disc usage is based on the size of file so don’t get confused that even small file will consume full block storage.
Each block consumes some amount of RAM of Namenode.
The Namenode can address fixed number of blocks which depend on RAM in it.
The small files in the Hadoop cluster are continuously hitting the Namenode memory limit.
So sooner or later we can have the problem which names condition that we can no longer add data to cluster as Namenode cannot address new blocks.
Let me summarize the problem with a simple example
We got 1000 small files each of 100KB each
Each file will have one block mapped against it. So Namenode will create 1000 blocks for it.
Whereas actual space consumption will be only 1000*100 KB ( assume replication =1 )
Whereas if we have combined file of 1000*100 KB then it would have consumed 1 block on Namenode.
Information about each block is stored in Namenode RAM. So instead of just 1 , we are consuming 1000 spaces in Namenode RAM.
Let’s see the solution part
Solution is straight forward , Merge the small files into bigger ones such that they consume full blocks.
I am listing the approaches which you can follow depending upon your choic
If the files have the same block size and replication factor you can use an HDFS-specific API (HDFSConcat) to concatenate two files into a single file w/o copying blocks.
2) IdentityMapper and IdentityReducer
Write a MR job with Identity mapper and Identity Reducer which spits out what it reads but with less number of output files. You can set this by configuring number of reducers
4) Hadoop File Crush
Crush - Crush small files in dfs to fewer, larger files
hadoop jar filecrush-2.0-SNAPSHOT.jar crush.Crush -Ddfs.block.size=134217728 \
/path/to/my/input /path/to/my/output/merge 20100221175612
Where 20100221175612 is random timestamp which you can at time of execution of script
Read this document for more details. I would recommend to go with this tool for solving this problem
Advantages of Filecrush are
1) Automatically takes care of sub folders
2) Options to do in folder or off folder merges
3) Let you to decide the format in which you want output. Supported format are Text and Sequence files
4) Let you to compress data
5) Hive concatenate utility
If your Hive tables are using ORC format then you can partition using Hive
Find list of partitions which are modified during last 7 days.
hive.PARTITIONS P, hive.DBS D,
T.DB_ID = D.DB_ID
P.TBL_ID = T.TBL_ID
Then run the Hive ALTER command to Concatenate for all the partitions which are needed
ALTER TABLE tbl_name PARTITION (part_spec) CONCATENATE;
Many people also recommend HAR and Sequence files , you might go that way also if you want to change format of your data.
Thanks for reading , please also share your approach of solving this problem in comments below.
More reading references