Upgrading Large Hadoop Cluster

Long-time back, I wrote one post on about Migrating Large Hadoop cluster in which I shared my experience about how we did migration between two Hadoop environments. Last weekend, we did another similar activity which I thought to document and share.

Epilogue

We are a big Telco and we have many Hadoop environments and this post is about the upgrade story of one of the clusters we have.

Many weeks before The Weekend


For many many days, we were working to do upgrade one of our main Hadoop platforms. Since it was a major upgrade from version 2.6.4 to 3.1.5 HDP stack, it needed a lot of planning and testing. There were many things that helped us to face the D-day with confidence which I wanted to share.

Practice upgrades

We did 3 practice upgrades in our development environment to ensure we know exactly each and every step how it will work and what kinds of issues we can face. A comprehensive knowledge base for all known errors and solutions was made based on this exercise. This document was shared with all team members involved in the upgrade activity so that someone will remember when we see an issue during the real upgrade. It does become extremely challenging when you get an issue and the clock is ticking to bring the cluster back up for Monday workload. We also did one full team upgrade practice run so that all team members know what steps and sequences are involved to get into a real one and everyone gets a feel of it.

Code changes

We did all the code changes required to ensure our existing applications can run comfortably in the new platform stack. Testing was done in the development environment was stood up with new stack versions. We opened that to all applications teams and use cases for testing the work they do.

Meeting the pre-requite for upgrade

One of the challenges we have is a massive amount of data. Being a Telco company, our network feeds can fill in cluster very quickly. We had to keep strong control over what data comes in and what queries users run to keep the total used cluster storage under 85%, a single wrong user query can fill in a cluster within hours. Our cluster is of decent size around 1.8 PB. So, moving the data when we are overusing HDFS to some other environment is also a normal flow for us.


One week before The weekend

Imaginary upgrade

We did an exercise in which we brainstormed a fictitious upgrade and tried to get into the mindset of what steps and sequences we will do to do an upgrade. We listed every minor thing which came to our mind right from raising change request to closing change request after the competition of upgrade. This imaginary exercise helped us to bring to our attention many things that were not planned earlier and allowed us to line our ducks into a make a perfect order of steps to be executed.

Applications upgrade and use case teams

In a large shared cluster environment finding all job dependencies and applications that are impacted, is a challenge. We started sharing bulk communication with all users of the platform for the planned upgrade 1 month in advance so that we get attention for all users eventually and applications that run on top of the platform to remind them about upcoming downtime for the system.

Data feeds redirection

Many data feeds inside the Telco space are very big. We have the opportunity to capture them once only and if we don’t, we lose that data. To prepare for the downtime we planned for the redirection for the same to the alternative platform with a view to bring them back to the main cluster post-upgrade. This exercise needs attention and proper impact analysis to find if feeds can be lost permanently, or we can grab them from the source later down in the future.

The time roaster

Few days before the upgrade we made a timeline view of the upgrade weekend. The goal was we can bring people in and out during the weekend giving them rest as required. We divided into people who come before upgrade into the picture to redirect and stop data feeds, people who do upgrade, people who come into picture post-upgrade to resume jobs, and stop data feed redirection. Besides the above group, we also had a group of people to act as a beta tester for testing all user experience items over the weekend This group structure gave a clear idea when people are entering the scene and deliver what is expected from them

The Weekend


Friday

We divided the upgrade into 8 different stages and decided to do a split for the whole upgrade with a goal of doing the Ambari upgrade on Friday and doing as much as possible on Friday from the subsequent stages. Ambari upgrade was easy and we did not hit any blocker and we were done with it within our planned time.

Saturday and Sunday

Our original estimate for the HDP and HDF upgrade based on my past experiences of upgrades was around 20 hours. But due to 3 technical issues we faced our timelines got pushed by 15 hours. Cloudera on-call engineers were very responsive to assist us with those problems. Hadoop is a massive beast, no single person can know all the things, so having access to SMEs from Cloudera when we needed was a massive morale booster for us. It was like we have someone to call if we need to, and they did jump in to resolve all the blockers we got. So, a massive thank you to the Cloudera team.

Credits


Collaboration and COVID

This upgrade has been different for us. Due to COVID like all companies worldwide we have been working remotely for the past many weeks. Without giving credit to Microsoft Teams for this, it will not be fair. Microsoft Teams made is possible since day 1 of work from home environment that we could work effectively. Our team of core 4 people involved in the upgrade was all hooked into one Team's meeting session for 3 days. We used screen sharing, document sharing features of Teams to make it easier for us to get the job done.

Kids and families

Lastly, it's worth mentioning the patience of our families who brought all meals next to the computer so that we could work and taking care of kids during long working hours. With Teams meeting broadcasting for many hours we could hear each other's kids (except for 1, who is a Bachelor :) ) shouting and trying to grab attention and wanting us to move away from the keyboard. With this upgrade over now we are back spending more and more time with them.

Weekend + 1 Monday


The upgrade has been successful, project teams, users are slowly coming back live on the platform. The users are reporting issues they are facing, and we are incrementally fixing them. Data has started to flow back into the platform, with flood gates of massive feeds to be opened later during the week and things are slowly getting back to normal. Our users are excited with lots of new functionality this upgrade brings and I am proud of what we have achieved.

Massive planning, practice exercise has delivered a good outcome for us. We have missed planning for a few things, but we will learn from them, that is what life is, isn’t it?
Until next upgrade, goodbye.

Thank you for reading. Please do leave a comment below









Replace ssh key of the AWS EC2 machine

You can follow the below steps to change the SSH key for a AWS EC2 machine.

Step 1)

Check that you have existing ssh key working and we can log in to the machine using it. You can also directly login via a new function in AWS console.

Step 2)

Generate a new SSH key via Amazon Web Console

Step 3)

Get the public key from it. Using the command below

ssh-keygen -y -f ~/Downloads/second.pem

If working on Windows system using this https://www.puttygen.com/convert-pem-to-ppk


Step 4)

Login to the machine and edit the file.

vi ~/.ssh/authorized_keys

Add the new public key and check that you are able to login with the new key

Step 5)

Change permission of new key to 400 and try to login

Step 6)

If login is successful delete the old key from authorized keys file


Tips for AWS Professional certification exams

I recently gave AWS Professional exam and found certain things that are useful to be documented and share with all.
This post is generally about non-technical things. For technical pointers please see the other related posts.
AWS professional exams not only test your AWS knowledge but, also test our mental and physical strength. You have to sit there for 3 hours staring at the screen and read large large texts in questions and answers.
Below are the things which can be useful for you
Stretch at regular intervals
After every 30-45mins just stand up and stretch your legs, or just stretch your hands while sitting, this will keep your blood circulation moving and helps to keep you active.
Time management
AWS professional exams are the fight against time, you need to keep a constant tab on your clock which ticks on the screen. As a rough guideline try to complete 30 questions in the first 60mins, 30 in the next 60mins and keep the remaining 15 questions in 30mins. Try to spend just 2 mins per question and mark any thing which you are doubtful for a review and go ahead.
Time left >>>>> Questions pending
170 >>>>> 75
110 >>>>> 45
50 >>>>> 15
20 >>>>>  0
In the last 20mins try to review any questions which you have marked during the first pass.
Question reading
Read what is really asked first. Don’t start reading the question from the top to bottom and then read the answers. Don’t do this.
Just to make it clear, see the example question below from the AWS Solutions Architect Professional exam.
Question.
Your company’s on-premises content management system has the following architecture:

  • Application Tier – Java code on a JBoss application server
  • Database Tier – Oracle database regularly backed up to Amazon Simple Storage Service (S3) using the Oracle RMAN backup utility
  • Static Content – stored on a 512GB gateway stored Storage Gateway volume attached to the application server via the iSCSI interface

Which AWS based disaster recovery strategy will give you the best RTO? 

A) Deploy the Oracle database and the JBoss app server on EC2. Restore the RMAN Oracle backups from Amazon S3. Generate an EBS volume of static content from the Storage Gateway and attach it to the JBoss EC2 server.

B) Deploy the Oracle database on RDS. Deploy the JBoss app server on EC2. Restore the RMAN Oracle backups from Amazon Glacier. Generate an EBS volume of static content from the Storage Gateway and attach it to the JBoss EC2 server.

C) Deploy the Oracle database and the JBoss app server on EC2. Restore the RMAN Oracle backups from Amazon S3. Restore the static content by attaching an AWS Storage Gateway running on Amazon EC2 as an iSCSI volume to the JBoss EC2 server.

D) Deploy the Oracle database and the JBoss app server on EC2. Restore the RMAN Oracle backups from Amazon S3. Restore the static content from an AWS Storage Gateway-VTL running on Amazon EC2
----
When you see this question, the first thing which you should read is the text underlined. This will give you an idea of what is really needed. Then quickly go up and read the full questions and start reading the answers options. Keep on eliminating the answers till to are able to find the correct answer as per what is asked for the underlined text.

Elimination
While reading keeps on eliminating the wrong answers and iterate until you are able to reach the most suitable answer as per the ask of the question. If you are stuck with a choice of 2 and not able to decide on the final one, mark that question for review, randomly mark one choice as the answer, note on the notepad provided with question number and choices for which the final battle is to be decided. You can come back to this question later at the end of the exam if time permits. This happens very rarely that you will get time to review. So, we have to mark something in the first pass and then later decide if we get time in the review.


Water bottleA clear water bottle is allowed, keep it next to you and drink a sip when your brain starts steaming in the middle of the exam. It will happen that brain will be jammed in the middle at the regular intervals and you will need some fuel to keep it going :)

Thank you for reading. I am done with AWS Devops Professional and now preparing for the SA Pro.  What are your tips for the AWS professional exams?

How to pass AWS Devops Engineer Professional Exam

Hi,

I recently cleared AWS Devops Engineer Professional exam.

Below is what I did and hopefully can be helpful to you as well for the exam.

Course:

  • Stephane Udemy course. I did everything he told. Readings (He suggests lots of things to read) + Watched videos 2 times and labs. I did all 3 Associate exams also following Stephane. So, he does assume that we know the basics.
  • AWS Official Devops exam readiness. This course tells about how to approach the question and the test. I highly suggest you to take this and it is a free course.

Practice tests

This exam tests your time management. So keep an eye on the watch on your side. Try to finish 30 questions in first hour, 30 in second hour and 15 in the remaining time plus for the questions you marked for the review. I was very slow and managed to finish the exam only 5 mins to spare.

Good luck

How to pass AWS Certified Developer exam

I cleared AWS Certified Developer Associate exam with 968 scores. Below is my blueprint for success which you can follow.

Study material
  • Udemy Stephane Maarek course
  • Linux Academy course
Practice tests
  • Whizlabs
  • Udemy Stephane Maarek practice tests 
If you are short of time, then just do the Stephane Maarek course and practice with Whizlabs and Udemy Stephane Maarek practice test.

Good luck

Patterns

Architectural

https://en.wikipedia.org/wiki/Architectural_pattern


Software

https://en.wikipedia.org/wiki/Software_design_pattern

Convert webpage to pdf using Python

If you dont want to use Python then one easy way is to use website https://www.web2pdfconvert.com/

If you want to use Python then see the library below

https://pypi.org/project/pdfkit/