Upgrading Large Hadoop Cluster

Long-time back, I wrote one post on about Migrating Large Hadoop cluster in which I shared my experience about how we did migration between two Hadoop environments. Last weekend, we did another similar activity which I thought to document and share.

Epilogue

We are a big Telco and we have many Hadoop environments and this post is about the upgrade story of one of the clusters we have.

Many weeks before The Weekend


For many many days, we were working to do upgrade one of our main Hadoop platforms. Since it was a major upgrade from version 2.6.4 to 3.1.5 HDP stack, it needed a lot of planning and testing. There were many things that helped us to face the D-day with confidence which I wanted to share.

Practice upgrades

We did 3 practice upgrades in our development environment to ensure we know exactly each and every step how it will work and what kinds of issues we can face. A comprehensive knowledge base for all known errors and solutions was made based on this exercise. This document was shared with all team members involved in the upgrade activity so that someone will remember when we see an issue during the real upgrade. It does become extremely challenging when you get an issue and the clock is ticking to bring the cluster back up for Monday workload. We also did one full team upgrade practice run so that all team members know what steps and sequences are involved to get into a real one and everyone gets a feel of it.

Code changes

We did all the code changes required to ensure our existing applications can run comfortably in the new platform stack. Testing was done in the development environment was stood up with new stack versions. We opened that to all applications teams and use cases for testing the work they do.

Meeting the pre-requite for upgrade

One of the challenges we have is a massive amount of data. Being a Telco company, our network feeds can fill in cluster very quickly. We had to keep strong control over what data comes in and what queries users run to keep the total used cluster storage under 85%, a single wrong user query can fill in a cluster within hours. Our cluster is of decent size around 1.8 PB. So, moving the data when we are overusing HDFS to some other environment is also a normal flow for us.


One week before The weekend

Imaginary upgrade

We did an exercise in which we brainstormed a fictitious upgrade and tried to get into the mindset of what steps and sequences we will do to do an upgrade. We listed every minor thing which came to our mind right from raising change request to closing change request after the competition of upgrade. This imaginary exercise helped us to bring to our attention many things that were not planned earlier and allowed us to line our ducks into a make a perfect order of steps to be executed.

Applications upgrade and use case teams

In a large shared cluster environment finding all job dependencies and applications that are impacted, is a challenge. We started sharing bulk communication with all users of the platform for the planned upgrade 1 month in advance so that we get attention for all users eventually and applications that run on top of the platform to remind them about upcoming downtime for the system.

Data feeds redirection

Many data feeds inside the Telco space are very big. We have the opportunity to capture them once only and if we don’t, we lose that data. To prepare for the downtime we planned for the redirection for the same to the alternative platform with a view to bring them back to the main cluster post-upgrade. This exercise needs attention and proper impact analysis to find if feeds can be lost permanently, or we can grab them from the source later down in the future.

The time roaster

Few days before the upgrade we made a timeline view of the upgrade weekend. The goal was we can bring people in and out during the weekend giving them rest as required. We divided into people who come before upgrade into the picture to redirect and stop data feeds, people who do upgrade, people who come into picture post-upgrade to resume jobs, and stop data feed redirection. Besides the above group, we also had a group of people to act as a beta tester for testing all user experience items over the weekend This group structure gave a clear idea when people are entering the scene and deliver what is expected from them

The Weekend


Friday

We divided the upgrade into 8 different stages and decided to do a split for the whole upgrade with a goal of doing the Ambari upgrade on Friday and doing as much as possible on Friday from the subsequent stages. Ambari upgrade was easy and we did not hit any blocker and we were done with it within our planned time.

Saturday and Sunday

Our original estimate for the HDP and HDF upgrade based on my past experiences of upgrades was around 20 hours. But due to 3 technical issues we faced our timelines got pushed by 15 hours. Cloudera on-call engineers were very responsive to assist us with those problems. Hadoop is a massive beast, no single person can know all the things, so having access to SMEs from Cloudera when we needed was a massive morale booster for us. It was like we have someone to call if we need to, and they did jump in to resolve all the blockers we got. So, a massive thank you to the Cloudera team.

Credits


Collaboration and COVID

This upgrade has been different for us. Due to COVID like all companies worldwide we have been working remotely for the past many weeks. Without giving credit to Microsoft Teams for this, it will not be fair. Microsoft Teams made is possible since day 1 of work from home environment that we could work effectively. Our team of core 4 people involved in the upgrade was all hooked into one Team's meeting session for 3 days. We used screen sharing, document sharing features of Teams to make it easier for us to get the job done.

Kids and families

Lastly, it's worth mentioning the patience of our families who brought all meals next to the computer so that we could work and taking care of kids during long working hours. With Teams meeting broadcasting for many hours we could hear each other's kids (except for 1, who is a Bachelor :) ) shouting and trying to grab attention and wanting us to move away from the keyboard. With this upgrade over now we are back spending more and more time with them.

Weekend + 1 Monday


The upgrade has been successful, project teams, users are slowly coming back live on the platform. The users are reporting issues they are facing, and we are incrementally fixing them. Data has started to flow back into the platform, with flood gates of massive feeds to be opened later during the week and things are slowly getting back to normal. Our users are excited with lots of new functionality this upgrade brings and I am proud of what we have achieved.

Massive planning, practice exercise has delivered a good outcome for us. We have missed planning for a few things, but we will learn from them, that is what life is, isn’t it?
Until next upgrade, goodbye.

Thank you for reading. Please do leave a comment below