Hadoop 3 install tutorial single node

The purpose of the post is to explain how to install Hadoop 3 on your computer. This post considers that you have Linux based system available for use. I am doing this on Ubuntu system

Before you begin to create a separate user named Hadoop in the system and do all these operations in that.

This document covers the Steps to
1) Configure SSH
2) Install JDK
3) Install Hadoop

Update your repository

sudo apt-get update

You can directly copy the commands from there and run in your system
Hadoop requires that various systems present in the cluster can talk to each other freely. Hadoop use SSH to prove the identity for connection.

Let's Download and configure SSH

sudo apt-get install openssh-server openssh-client
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
sudo chmod go-w $HOME $HOME/.ssh
sudo chmod 600 $HOME/.ssh/authorized_keys
sudo chown `whoami` $HOME/.ssh/authorized_keys

Testing your SSH

ssh localhost

Say yes

It should open connection with SSH

exit

This will close the SSH

Java 1.8 is mandatory for running Hadoop

Lets Download and install JDK

sudo apt-get update
sudo apt-get install openjdk-8-jdk

Now comes the Hadoop :)

Download the latest tar in your computer for Hadoop 3.2.1 and unzip it to some directory lets say HADOOP_HOME

Export the following environment variables in your computer, change paths according to your environment.

export HADOOP_HOME="/home/jj/dev/softwares/hadoop-3.2.1"
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop

We need to modify/create following property files in the etc/hadoop directory

Edit core-site.xml with following contents

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost</value>
</property>
</configuration>


Edit hdfs-site.xml with following contents

<configuration>
<property>
    <name>dfs.namenode.name.dir</name>
    <value>file:/home/jj/dev/softwares/hadoop-3.2.1/hadoop_space/hdfs/dfs/name</value>
    <description>Determines where on the local filesystem the DFS name node
      should store the name table.  If this is a comma-delimited list
      of directories then the name table is replicated in all of the
      directories, for redundancy. </description>
    <final>true</final>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:/home/jj/dev/softwares/hadoop-3.2.1/hadoop_space/hdfs/dfs/data</value>
    <description>Determines where on the local filesystem an DFS data node
       should store its blocks.  If this is a comma-delimited
       list of directories, then data will be stored in all named
       directories, typically on different devices.
       Directories that do not exist are ignored.
    </description>
    <final>true</final>
  </property>
<property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>



The path
file:/home/jj/dev/softwares/hadoop-3.2.1/hadoop_space/hdfs/dfs/name AND
file:/home/jj/dev/softwares/hadoop-3.2.1/hadoop_space/hdfs/dfs/data
are some folders in your computer which would give space to store data and name edit files
Path should be specified as URI

Edit  mapred-site.xml inside /etc/hadoop with following contents

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
    <name>mapred.system.dir</name>
    <value>file:/home/jj/dev/softwares/hadoop-3.2.1/hadoop_space/mapred/system</value>
    <final>true</final>
  </property>
  <property>
    <name>mapred.local.dir</name>
    <value>file:/home/jj/dev/softwares/hadoop-3.2.1/hadoop_space/mapred/local</value>
    <final>true</final>
  </property>
      <property>
        <name>mapreduce.application.classpath</name>
        <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
    </property>
</configuration>


The path
file:/home/jj/dev/softwares/hadoop-3.2.1/hadoop_space/mapred/system  AND
file:/home/jj/dev/softwares/hadoop-3.2.1/hadoop_space/mapred/local
are some folders in your computer which would give space to store data
Path should be specified as URI

Edit yarn-site.xml with following contents

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
    </property>
</configuration>


Inside conf directory
Create one file hadoop-env.sh and add following to it

export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"

Change the path above for your JAVA_HOME as per location where it is inside your PC

Save it and now we are ready to format

Format the namenode
# hdfs namenode –format
Say Yes and let it complete the format

Time to start the daemons
# ./sbin/hadoop-daemon.sh start namenode
# ./sbin/hadoop-daemon.sh start datanode

Start Yarn Daemons
# ./sbin/yarn-daemon.sh start resourcemanager
# ./sbin/yarn-daemon.sh start nodemanager

Time to check if Daemons have started
Enter the command
# jps

2539 NameNode
2744 NodeManager
3075 Jps
3030 DataNode
2691 ResourceManager
Time to launch UI

Open the localhost:8088 to see the Resource Manager page
Open the localhost:9870 to see the Namenode page

Done :)
Happy Hadooping :)

Maven package org.apache.commons.httpclient.methods does not exist

Apache commons http client has two versions

https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient

https://mvnrepository.com/artifact/commons-httpclient/commons-httpclient

From

https://stackoverflow.com/questions/10986661/apache-httpclient-does-not-exist-error

Solution:

Use older version of the client

<dependency>
<groupId>commons-httpclient</groupId>
<artifactId>commons-httpclient</artifactId>
<version>3.1</version>
</dependency>



Ubuntu 18.04 customizations

Ubuntu comes with lots of good options to configure the system.

Few of the things which I like are mentioned below.

Enable Gnome Shell extensions and Windows like themes

https://www.howtogeek.com/353819/how-to-make-ubuntu-look-more-like-windows/


sudo apt install gnome-shell-extensions gnome-shell-extension-dash-to-panel 
auso apt install gnome-tweaks adwaita-icon-theme-full
 
Install few good extensions

https://itsfoss.com/things-to-do-after-installing-ubuntu-18-04/

To use Gnome shell enable the browser extension and also the host extension.

Once you do that, you will see the toggle button for any gnome extension, right inside the browser Window.

You can also congfigure that in Tweaks application
 
 

Gradle Could not create service of type ScriptPluginFactory

Error

Could not create service of type ScriptPluginFactory using BuildScopeServices.createScriptPluginFactory().


Detailed exception

[jj@184fc3b978cc bigtop]$ ./gradlew clean

FAILURE: Build failed with an exception.

* What went wrong:
Could not create service of type ScriptPluginFactory using BuildScopeServices.createScriptPluginFactory().
> Could not create service of type CrossBuildFileHashCache using BuildSessionScopeServices.createCrossBuildFileHashCache().

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.

* Get more help at https://help.gradle.org

BUILD FAILED in 0s

Solution

The directory where code was there was owned by root:root

Change the ownership back to your user and it should work

Clevo P570WM Ubuntu freeze problem

TL;DR

To get Ubuntu running on a Clevo laptop get kernel version 5.6.15+. Here are the steps to upgrade Kernel.

7 years back in 2013, I bought a P570WM laptop from Metabox, I did one big mistake that I invested money on something which was brand new in the market. My goal was to get some good Ubuntu laptop. I thought to go ahead with Clevo based laptop.

Below were the specs, they might not look impressive, but if you think these are 7 years old, then you will feel they were really good back then.

  • Screen type: 17.3" FHD 1920x1080 LED/LCD 
  • Graphics: Nvidia GTX 780M 4GB GDDR5 video graphics
  • Processor: i7-3970X 6-Core 3.5GHz - 4.0GHz 15MB Cache
  • RAM memory: 32GB DDR3 1600Mhz RAM 
  • Primary drive: 1TB 7200 rpm primary hard drive

I was very happy with the laptop arrived, but my nightmare began when I installed Ubuntu in it and it froze immediately on the boot.

Unfortunately, Metabox was of no help, they said they don't support Ubuntu and I was left alone with a massive waste of money of $5.5K machine of no use.

Fast forward 2020, that laptop was collecting dust in my cupboard and I thought to give it another shot and glad it worked. I am using Ubuntu 18.04 with Kernel 5.6.15 and no custom drivers for the Nvidia card.

The posts which gave me a hope to keep moving are mentioned below

https://edgcert.com/2019/06/03/ubuntu-on-clevo/
https://forum.manjaro.org/t/freezes-when-probing-system-on-clevo-n850hk1/60346
https://askubuntu.com/questions/1068161/clevo-n850el-crashes-freezes-ubuntu-18-04-1-frequently

Related Kernel bug
https://bugzilla.kernel.org/show_bug.cgi?id=109051








Upgrading Large Hadoop Cluster

Long-time back, I wrote one post on about Migrating Large Hadoop cluster in which I shared my experience about how we did migration between two Hadoop environments. Last weekend, we did another similar activity which I thought to document and share.

Epilogue

We are a big Telco and we have many Hadoop environments and this post is about the upgrade story of one of the clusters we have.

Many weeks before The Weekend


For many many days, we were working to do upgrade one of our main Hadoop platforms. Since it was a major upgrade from version 2.6.4 to 3.1.5 HDP stack, it needed a lot of planning and testing. There were many things that helped us to face the D-day with confidence which I wanted to share.

Practice upgrades

We did 3 practice upgrades in our development environment to ensure we know exactly each and every step how it will work and what kinds of issues we can face. A comprehensive knowledge base for all known errors and solutions was made based on this exercise. This document was shared with all team members involved in the upgrade activity so that someone will remember when we see an issue during the real upgrade. It does become extremely challenging when you get an issue and the clock is ticking to bring the cluster back up for Monday workload. We also did one full team upgrade practice run so that all team members know what steps and sequences are involved to get into a real one and everyone gets a feel of it.

Code changes

We did all the code changes required to ensure our existing applications can run comfortably in the new platform stack. Testing was done in the development environment was stood up with new stack versions. We opened that to all applications teams and use cases for testing the work they do.

Meeting the pre-requite for upgrade

One of the challenges we have is a massive amount of data. Being a Telco company, our network feeds can fill in cluster very quickly. We had to keep strong control over what data comes in and what queries users run to keep the total used cluster storage under 85%, a single wrong user query can fill in a cluster within hours. Our cluster is of decent size around 1.8 PB. So, moving the data when we are overusing HDFS to some other environment is also a normal flow for us.


One week before The weekend

Imaginary upgrade

We did an exercise in which we brainstormed a fictitious upgrade and tried to get into the mindset of what steps and sequences we will do to do an upgrade. We listed every minor thing which came to our mind right from raising change request to closing change request after the competition of upgrade. This imaginary exercise helped us to bring to our attention many things that were not planned earlier and allowed us to line our ducks into a make a perfect order of steps to be executed.

Applications upgrade and use case teams

In a large shared cluster environment finding all job dependencies and applications that are impacted, is a challenge. We started sharing bulk communication with all users of the platform for the planned upgrade 1 month in advance so that we get attention for all users eventually and applications that run on top of the platform to remind them about upcoming downtime for the system.

Data feeds redirection

Many data feeds inside the Telco space are very big. We have the opportunity to capture them once only and if we don’t, we lose that data. To prepare for the downtime we planned for the redirection for the same to the alternative platform with a view to bring them back to the main cluster post-upgrade. This exercise needs attention and proper impact analysis to find if feeds can be lost permanently, or we can grab them from the source later down in the future.

The time roaster

Few days before the upgrade we made a timeline view of the upgrade weekend. The goal was we can bring people in and out during the weekend giving them rest as required. We divided into people who come before upgrade into the picture to redirect and stop data feeds, people who do upgrade, people who come into picture post-upgrade to resume jobs, and stop data feed redirection. Besides the above group, we also had a group of people to act as a beta tester for testing all user experience items over the weekend This group structure gave a clear idea when people are entering the scene and deliver what is expected from them

The Weekend


Friday

We divided the upgrade into 8 different stages and decided to do a split for the whole upgrade with a goal of doing the Ambari upgrade on Friday and doing as much as possible on Friday from the subsequent stages. Ambari upgrade was easy and we did not hit any blocker and we were done with it within our planned time.

Saturday and Sunday

Our original estimate for the HDP and HDF upgrade based on my past experiences of upgrades was around 20 hours. But due to 3 technical issues we faced our timelines got pushed by 15 hours. Cloudera on-call engineers were very responsive to assist us with those problems. Hadoop is a massive beast, no single person can know all the things, so having access to SMEs from Cloudera when we needed was a massive morale booster for us. It was like we have someone to call if we need to, and they did jump in to resolve all the blockers we got. So, a massive thank you to the Cloudera team.

Credits


Collaboration and COVID

This upgrade has been different for us. Due to COVID like all companies worldwide we have been working remotely for the past many weeks. Without giving credit to Microsoft Teams for this, it will not be fair. Microsoft Teams made is possible since day 1 of work from home environment that we could work effectively. Our team of core 4 people involved in the upgrade was all hooked into one Team's meeting session for 3 days. We used screen sharing, document sharing features of Teams to make it easier for us to get the job done.

Kids and families

Lastly, it's worth mentioning the patience of our families who brought all meals next to the computer so that we could work and taking care of kids during long working hours. With Teams meeting broadcasting for many hours we could hear each other's kids (except for 1, who is a Bachelor :) ) shouting and trying to grab attention and wanting us to move away from the keyboard. With this upgrade over now we are back spending more and more time with them.

Weekend + 1 Monday


The upgrade has been successful, project teams, users are slowly coming back live on the platform. The users are reporting issues they are facing, and we are incrementally fixing them. Data has started to flow back into the platform, with flood gates of massive feeds to be opened later during the week and things are slowly getting back to normal. Our users are excited with lots of new functionality this upgrade brings and I am proud of what we have achieved.

Massive planning, practice exercise has delivered a good outcome for us. We have missed planning for a few things, but we will learn from them, that is what life is, isn’t it?
Until next upgrade, goodbye.

Thank you for reading. Please do leave a comment below









Replace ssh key of the AWS EC2 machine

You can follow the below steps to change the SSH key for a AWS EC2 machine.

Step 1)

Check that you have existing ssh key working and we can log in to the machine using it. You can also directly login via a new function in AWS console.

Step 2)

Generate a new SSH key via Amazon Web Console

Step 3)

Get the public key from it. Using the command below

ssh-keygen -y -f ~/Downloads/second.pem

If working on Windows system using this https://www.puttygen.com/convert-pem-to-ppk


Step 4)

Login to the machine and edit the file.

vi ~/.ssh/authorized_keys

Add the new public key and check that you are able to login with the new key

Step 5)

Change permission of new key to 400 and try to login

Step 6)

If login is successful delete the old key from authorized keys file