How to install Hadoop on Ubuntu 20.04?

Overview

Hadoop is an open-source software which is used to store and process large data sets in a distributed computing environment. It runs on low-cost hardware, making it affordable for businesses. Hadoop consists of two major core components namely HDFS and MapReduce.  The Hadoop Distributed File System (HDFS) allows data storage across multiple servers in a cluster, while MapReduce is a processing engine that enables distributed processing of large data sets across the cluster. 

In this tutorial, we'll walk you through the process of how to install Hadoop on Ubuntu20.04.

Prerequisites

There are certain prerequisites that need to be met before you begin:

Ubuntu 20.04 server configured on your system
A regular root user with sudo privileges
Internet connect ion

Key

Red box- Input
Green box- Output

Get Started

Step 1: Java Installation:

First, update your system by opening the terminal and running the following commands:

sudo -i

sudo apt-get update && sudo apt-get upgrade

Install OpenJDK 11 or any latest version of your choice, by running the following command:

sudo apt-get install openjdk-11-jdk

Once the installation is complete, let's verify it with the following command:

java –version

Step 2: Creating Hadoop User

We will now create a dedicated user using the 'adduser' command.

sudo adduser hadoopuser

This will require entering the user's new password, full name, and other relevant information. To confirm that the details entered are correct, we will need to type "y/Y".

To switch to the Hadoop user that we have created, which in this case is 'hadoopuser', we need to run the following command:

su - hadoopuser

Step 3: Generating Public and Private key-pairs

Once that's done, we can generate the private and public key pairs using the command below:

ssh-keygen -t rsa

When prompted, specify the file location where you would like to save the key pair, and provide a passphrase that will be used for the Hadoop user setup.

Once the key pairs have been generated, we can add them to the SSH authorized_keys, using the command below:

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Step 4: Permitting and Authorizing key-pairs

Since we have saved the key pair to the SSH authorized keys, we will now need to update the file permissions to 640. This will ensure that only the file owner (i.e., us) will have both read and write permissions, while the group will only have read permissions. No permissions will be granted to other users.

chmod 640 ~/.ssh/authorized_keys

Authenticate the localhost, using the following command:

ssh localhost

Enter yes to continue.

Step 5: Downloading and Installing Hadoop

To install the Hadoop framework on your system, you can use the following wget command: You may download any latest version. Here, we have downloaded the more stable Hadoop version 3.2.4.

wget https://downloads.apache.org/hadoop/common/hadoop-3.2.4/hadoop-3.2.4.tar.gz

Once the download is complete, extract the hadoop-3.2.4.tar.gz file using the tar command.

tar -xvzf hadoop-3.2.4.tar.gz

You can rename the extracted directory using the command provided below:

mv hadoop-3.2.4 hadoop

Step 6: Setting up Hadoop

Next, you will need to configure the Java environment variables for setting up Hadoop. To do this, start by checking the location of the JAVA_HOME variable.

dirname $(dirname $(readlink -f $(which java)))

Open ~/.bashrc" file in the text editor of your choice. Here, we are using nano text editor.

nano ~/.bashrc

Open the file "~/.bashrc" and add the specified paths.

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64  

export HADOOP_HOME=/home/hadoopuser/hadoop  

export HADOOP_INSTALL=$HADOOP_HOME  

export HADOOP_MAPRED_HOME=$HADOOP_HOME  

export HADOOP_COMMON_HOME=$HADOOP_HOME  

export HADOOP_HDFS_HOME=$HADOOP_HOME  

export HADOOP_YARN_HOME=$HADOOP_HOME  

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native  

export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin  

export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

Once you've added them, save the changes to the file by pressing CTRL+O and CTRL+X to exit from the editor.

Use the following command to activate the JAVA_HOME environment variable:

source ~/.bashrc

Open the environment variable file for Hadoop and configure the JAVA_HOME variable for the Hadoop environment.

nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

Step 7: Configuring Hadoop for Ubuntu

For configuring Hadoop properly, it is necessary to create two directories namely, "datanode" and "namenode" - inside the home directory of Hadoop, using the followig commands:

mkdir -p ~/hadoopdata/hdfs/namenode

mkdir -p ~/hadoopdata/hdfs/datanode

To update the Hadoop core-site.xml file, you will need to add your hostname. Begin by confirming your system hostname using the command provided below.

hostname

Then, open the core-site.xml file in the nano editor.

nano $HADOOP_HOME/etc/hadoop/core-site.xml

Configure the core-site.xml file by adding the following lines:

<configuration> 

 <property> <name>fs.defaultFS</name>  

<value>hdfs://localhost:9000</value>  

</property>  

</configuration>

nano $HADOOP_HOME/etc/hadoop/ hdfs-site.xml

Configure the hdfs-site.xml file by adding the following lines:

<configuration> 

<property> <name>dfs.replication</name>  

<value>1</value>  

</property>  

<property> <name>dfs.namenode.name.dir</name>  

<value>/usr/local/hadoop/data/dfs/namenode</value>  

</property>  

<property> <name>dfs.datanode.data.dir</name>  

<value>/usr/local/hadoop/data/dfs/datanode</value> 

  </property> 

 </configuration>

Open configuration file of MapReduce, with the following command:

nano $HADOOP_HOME/etc/hadoop/ mapred-site.xml

Configure the mapred-site.xml file by adding the following lines:

 <configuration> 

<property>  

<name>mapreduce.framework.name</name> 

 <value>yarn</value>  

</property>  

</configuration>

Open yarn configuration file, with the following command:

nano $HADOOP_HOME/etc/hadoop/ yarn-site.xml

Configure the yarn-site.xml file by adding the following lines:

<configuration>  

<property>  

<name>yarn.nodemanager.aux-services</name>  

<value>mapreduce_shuffle</value> </property> 

  <property>  

<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> 

 <value>org.apache.hadoop.mapred.ShuffleHandler</value> 

</property>  

</configuration>

Step 8: Running Hadoop Cluster

Format the Hadoop file system by running the following command:

hdfs namenode –format

Start the Hadoop daemons by running the following command:

start-dfs.sh

start-yarn.sh

Verify that the Hadoop daemons are running by running the following command:

jps

To enable Hadoop to listen at ports 8088 and 9870, you will need to allow these ports through the firewall, using the following commands:

sudo ufw allow 8088

sudo ufw allow 9870

Note: If you face error like <user is not in sudoer file>, as shown in the below image in green box, follow the below steps:

Logout from specific user using the following command:

logout

Enter root user using the following command:

sudo -i

To access your Hadoop "namenode", open your web browser and enter your IP address followed by the port numbers 9870 and 8088. You'll be prompted to the screen, as shown below:

Conclusion

Hadoop is now successfully installed. You may start efficiently storing and managing your large datasets.

PreviousHow to install Grafana on Ubuntu 20.04?NextHow to install Homebrew on Linux?

Last updated 8 months ago