“The text book launch” , any Indian would have heard this phrase The text book launch, often used by ISRO after successful launch of satellites into space.
Last weekend we did similar exercise , The textbook hadoop cluster migration. This blog posts shares thoughts and experiences around the same.
Just for story making i am masking customer specific details.
Epilogue
It will be two years this June (2014) when I landed in this totally new city for me , Sydney. I came to do implementation of Hadoop for this large financial customer. This journey has been great in-fact awesome learning experience and opportunity to work with best brains of the world.
The Weekend Task
We had to migrate to new cluster as we were almost out of space and also needed more computation power. While writing this blog post i used the term WE , it has been combined effort of awesome bunch of people.
Old Cluster configuration 11 Nodes having about 250 TB space
New Cluster 25 Nodes
The two cluster were running different versions of Hadoop , so move was from CDH 4.3 to CDH 4.6 and luckily there was binary compatibility between these two releases.
The cluster also runs Hbase in production and has replica mirror cluster ( 3rd cluster)
The Plan
This activity had in background lots of planning around
Testing of the sample production jobs on new cluster
Total Automation and setup of new cluster via Cloudera manager API
Code Migration of over 300 production oozie jobs which pump data into cluster and produce scores
Trickle feed resuming the jobs in batches
Roaster for shifts in which people will work to complete the activity within one weekend
Good clear communication on what is happening for all the people involved was very important , Given the fact that our team was spread across Sydney and India.
Besides traditional email we also used whatsapp for communication so that all people can be aware of things and can read
messages when they are available. ( not to awake someone from sleep , right after completing roaster shift )
The downtime of the cluster was started from Friday evening , data copy via distcp was started few days back during nights so that in weekend we can use distcp in update mode to transfer the newly added data.
The Action
New Cluster setup
Given the experience with managing old cluster , it was priority of the team that any new cluster bought would be setup via 100% automation , the new cluster configuration , setup is driven via puppet and Cloudera Manager API. The non hadoop components are installed via normal Puppet packages. After the setup of the new cluster via automation the machines were ready to be loaded.
Data Migration
During the Friday night distcp was started in update mode to complete the data copy process moving all onto new cluster. Overnight stay in office was planned and distcp with World Cup soccer match was good combination :) Since HBase is also part of the production we had to move it. To move HBase we used the similar distcp copy process , we had brought down everything (except HDFS and MR) from source cluster. This approach is to be used if and only if the HBase is down on both sides.
See the steps mentioned here on Apache wiki
The distcp was complete by 7:00 AM Saturday morning and full data migration was verified by taking full folder tree dump on source and destination cluster and compared the sizes.
The job of overnight stay team was over and with the dawn of Saturday morning the job of Code migration team was about to begin.
Code Migration
The activity to sample test the existing production jobs was carried out few days in advance. This allowed us to find the issues from binary compatibilities to
network issues for Sqoop jobs which needs firewall to be opened for talking to new cluster.
Given the large number of production jobs testing and being confident that everything would work 100% in new cluster was challenging task.
One of the main focus was to capture what's running at this moment in the cluster ( Thanks to our very strict Development manager :) ) We ran extensive checks to capture the current state of code in the system and moved back to code repository.
From over 300 jobs we found 3 oozie workflows which were having
definitions out of date in our code repo from what was running in production. With large number of property files , oozie workflow testing can be difficult.
I will write a new blog post of learning about best practices in handling large number of oozie workflows , especially around regression testing and structuring the oozie code.
Oozie is awesome tool , we never had issues and many of jobs were running silently from over 12 months now.
We created new branch after dumping the current state of production cluster code and started changing the old code base to configuration specifications of new cluster.
By late Saturday afternoon the code migration was complete and ready to be run onto the new cluster. We ran our first job onto new cluster with success and shared the news with all over watsapp.
Whatsapp group messaging has worked very well I think , keeping all team members aware of current happening in base camp.
By late evening work of code migration shift was over and new team arrived to take over for resuming the production jobs.
Trickle resuming the production jobs
The cluster is under SLA for downstream consumers , we started resuming the production jobs which were having highest priority. The team managing the platform for operations had now control in hands. They started verifying the oozie jobs after starting them incrementally into new cluster. We have our own custom job tracking database , so writing one simple query into MySQL gave clear view which job is having problem and needs attention.
Hive and Metadata
We took the dump of Hive MySQL metastore from the old cluster , created corresponding database on the new cluster. Since there is mirror cluster we also configured MySQL replication for the same.
The closing thoughts
We did everything without any support call , ticket , Apache mailing list email. This shows that our team is capable and learned enough to deal with wide range of things now in the Hadoop ecosystem.
However there are few lessons which we need to learn from and make truly the textbook hadoop migration.
There is always something which is not in code base. Getting discipline in the large team that everything lands up into code base is the most difficult thing.
The last minute fixes often lead to changes which are running into Production but never find the way into source control. Hence we missed few minor things.
Although the changes did not affect our planned time lines since we knew how to fix the issues ( since we had seen time earlier too ) , we fixed them and task of adding the
fixes back to source control is the action point.
Automation , is the key to manage large number of multiple clusters. Platform team did awesome job in capturing the current state of old cluster and tuning required
properties in new cluster. All the deployment of new cluster was done via Puppet and Cloudera Manager API. So any configuration changes were also driven by the same.
The properties which code migration team found were missing during actual resumption of production jobs were passed back into the loop to be added to puppet.
Permissions , the jobs which failed running in the new cluster were due to permission issues. One of the action point we noted was to how to capture this information back into our Puppet so that all future deployments take care of folders , owners and permissions for us.
Oozie code structuring , Managing oozie code given the large amount of xmls and configurations can be difficult and this can be annoying at times. Understanding of concept of oozie globals , share lib and bundles is very important given the fact when you are using it in production for large deployments. I will write additional followup post for oozie code.
-----
Good bye for now , soon would be back with another post for the next task for moving on to YARN by end of next quarter.