How to install Hadoop on Ubuntu 20.04?
Overview
Hadoop is an open-source software which is used to store and process large data sets in a distributed computing environment. It runs on low-cost hardware, making it affordable for businesses. Hadoop consists of two major core components namely HDFS and MapReduce. The Hadoop Distributed File System (HDFS) allows data storage across multiple servers in a cluster, while MapReduce is a processing engine that enables distributed processing of large data sets across the cluster.
In this tutorial, we'll walk you through the process of how to install Hadoop on Ubuntu20.04.
Prerequisites
There are certain prerequisites that need to be met before you begin:
Ubuntu 20.04 server configured on your system
A regular root user with sudo privileges
Internet connect ion
Key
Red box- Input
Green box- Output
Get Started
Step 1: Java Installation:
First, update your system by opening the terminal and running the following commands:
sudo -i
sudo apt-get update && sudo apt-get upgrade
Install OpenJDK 11 or any latest version of your choice, by running the following command:
sudo apt-get install openjdk-11-jdk

Once the installation is complete, let's verify it with the following command:
java –version

Step 2: Creating Hadoop User
We will now create a dedicated user using the 'adduser' command.
sudo adduser hadoopuser

This will require entering the user's new password, full name, and other relevant information. To confirm that the details entered are correct, we will need to type "y/Y".
To switch to the Hadoop user that we have created, which in this case is 'hadoopuser', we need to run the following command:
su - hadoopuser
Step 3: Generating Public and Private key-pairs
Once that's done, we can generate the private and public key pairs using the command below:
ssh-keygen -t rsa

When prompted, specify the file location where you would like to save the key pair, and provide a passphrase that will be used for the Hadoop user setup.
Once the key pairs have been generated, we can add them to the SSH authorized_keys, using the command below:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Step 4: Permitting and Authorizing key-pairs
Since we have saved the key pair to the SSH authorized keys, we will now need to update the file permissions to 640. This will ensure that only the file owner (i.e., us) will have both read and write permissions, while the group will only have read permissions. No permissions will be granted to other users.
chmod 640 ~/.ssh/authorized_keys

Authenticate the localhost, using the following command:
ssh localhost

Enter yes to continue.
Step 5: Downloading and Installing Hadoop
To install the Hadoop framework on your system, you can use the following wget command: You may download any latest version. Here, we have downloaded the more stable Hadoop version 3.2.4.
wget https://downloads.apache.org/hadoop/common/hadoop-3.2.4/hadoop-3.2.4.tar.gz

Once the download is complete, extract the hadoop-3.2.4.tar.gz file using the tar command.
tar -xvzf hadoop-3.2.4.tar.gz

You can rename the extracted directory using the command provided below:
mv hadoop-3.2.4 hadoop
Step 6: Setting up Hadoop
Next, you will need to configure the Java environment variables for setting up Hadoop. To do this, start by checking the location of the JAVA_HOME variable.
dirname $(dirname $(readlink -f $(which java)))

Open ~/.bashrc" file in the text editor of your choice. Here, we are using nano text editor.
nano ~/.bashrc
Open the file "~/.bashrc" and add the specified paths.
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_HOME=/home/hadoopuser/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

Once you've added them, save the changes to the file by pressing CTRL+O and CTRL+X to exit from the editor.
Use the following command to activate the JAVA_HOME environment variable:
source ~/.bashrc
Open the environment variable file for Hadoop and configure the JAVA_HOME variable for the Hadoop environment.
nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
Step 7: Configuring Hadoop for Ubuntu

For configuring Hadoop properly, it is necessary to create two directories namely, "datanode" and "namenode" - inside the home directory of Hadoop, using the followig commands:
mkdir -p ~/hadoopdata/hdfs/namenode
mkdir -p ~/hadoopdata/hdfs/datanode

To update the Hadoop core-site.xml file, you will need to add your hostname. Begin by confirming your system hostname using the command provided below.
hostname

Then, open the core-site.xml file in the nano editor.
nano $HADOOP_HOME/etc/hadoop/core-site.xml
Configure the core-site.xml file by adding the following lines:
<configuration>
<property> <name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
nano $HADOOP_HOME/etc/hadoop/ hdfs-site.xml
Configure the hdfs-site.xml file by adding the following lines:
<configuration>
<property> <name>dfs.replication</name>
<value>1</value>
</property>
<property> <name>dfs.namenode.name.dir</name>
<value>/usr/local/hadoop/data/dfs/namenode</value>
</property>
<property> <name>dfs.datanode.data.dir</name>
<value>/usr/local/hadoop/data/dfs/datanode</value>
</property>
</configuration>
Open configuration file of MapReduce, with the following command:
nano $HADOOP_HOME/etc/hadoop/ mapred-site.xml
Configure the mapred-site.xml file by adding the following lines:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

Open yarn configuration file, with the following command:
nano $HADOOP_HOME/etc/hadoop/ yarn-site.xml
Configure the yarn-site.xml file by adding the following lines:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value> </property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>

Step 8: Running Hadoop Cluster
Format the Hadoop file system by running the following command:
hdfs namenode –format

Start the Hadoop daemons by running the following command:
start-dfs.sh

start-yarn.sh

Verify that the Hadoop daemons are running by running the following command:
jps

To enable Hadoop to listen at ports 8088 and 9870, you will need to allow these ports through the firewall, using the following commands:
sudo ufw allow 8088
sudo ufw allow 9870


Logout from specific user using the following command:
logout
Enter root user using the following command:
sudo -i
To access your Hadoop "namenode", open your web browser and enter your IP address followed by the port numbers 9870 and 8088. You'll be prompted to the screen, as shown below:

Conclusion
Hadoop is now successfully installed. You may start efficiently storing and managing your large datasets.
Last updated
Was this helpful?