How to install Hadoop on Ubuntu 20.04?

Overview

Hadoop is an open-source software which is used to store and process large data sets in a distributed computing environment. It runs on low-cost hardware, making it affordable for businesses. Hadoop consists of two major core components namely HDFS and MapReduce.  The Hadoop Distributed File System (HDFS) allows data storage across multiple servers in a cluster, while MapReduce is a processing engine that enables distributed processing of large data sets across the cluster. 

In this tutorial, we'll walk you through the process of how to install Hadoop on Ubuntu20.04.

Prerequisites

There are certain prerequisites that need to be met before you begin:

  • Ubuntu 20.04 server configured on your system

  • A regular root user with sudo privileges

  • Internet connect ion

Key

  • Red box- Input

  • Green box- Output

Get Started

Step 1: Java Installation:

  • First, update your system by opening the terminal and running the following commands:

sudo -i 
sudo apt-get update && sudo apt-get upgrade 
  • Install OpenJDK 11 or any latest version of your choice, by running the following command:

sudo apt-get install openjdk-11-jdk 
  • Once the installation is complete, let's verify it with the following command:

java –version 

Step 2: Creating Hadoop User

  • We will now create a dedicated user using the 'adduser' command.

sudo adduser hadoopuser 

This will require entering the user's new password, full name, and other relevant information. To confirm that the details entered are correct, we will need to type "y/Y".

  • To switch to the Hadoop user that we have created, which in this case is 'hadoopuser', we need to run the following command:

su - hadoopuser 

Step 3: Generating Public and Private key-pairs

  • Once that's done, we can generate the private and public key pairs using the command below:

ssh-keygen -t rsa 

When prompted, specify the file location where you would like to save the key pair, and provide a passphrase that will be used for the Hadoop user setup.

  • Once the key pairs have been generated, we can add them to the SSH authorized_keys, using the command below:

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys 

Step 4: Permitting and Authorizing key-pairs

  • Since we have saved the key pair to the SSH authorized keys, we will now need to update the file permissions to 640. This will ensure that only the file owner (i.e., us) will have both read and write permissions, while the group will only have read permissions. No permissions will be granted to other users.

chmod 640 ~/.ssh/authorized_keys 
  • Authenticate the localhost, using the following command:

ssh localhost 
  • Enter yes to continue.

Step 5: Downloading and Installing Hadoop

  • To install the Hadoop framework on your system, you can use the following wget command: You may download any latest version. Here, we have downloaded the more stable Hadoop version 3.2.4.

wget https://downloads.apache.org/hadoop/common/hadoop-3.2.4/hadoop-3.2.4.tar.gz 

  • Once the download is complete, extract the hadoop-3.2.4.tar.gz file using the tar command.

tar -xvzf hadoop-3.2.4.tar.gz 

  • You can rename the extracted directory using the command provided below:

mv hadoop-3.2.4 hadoop 

Step 6: Setting up Hadoop

  • Next, you will need to configure the Java environment variables for setting up Hadoop. To do this, start by checking the location of the JAVA_HOME variable.

dirname $(dirname $(readlink -f $(which java))) 

  • Open ~/.bashrc" file in the text editor of your choice. Here, we are using nano text editor.

nano ~/.bashrc 
  • Open the file "~/.bashrc" and add the specified paths.

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64  

export HADOOP_HOME=/home/hadoopuser/hadoop  

export HADOOP_INSTALL=$HADOOP_HOME  

export HADOOP_MAPRED_HOME=$HADOOP_HOME  

export HADOOP_COMMON_HOME=$HADOOP_HOME  

export HADOOP_HDFS_HOME=$HADOOP_HOME  

export HADOOP_YARN_HOME=$HADOOP_HOME  

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native  

export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin  

export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native" 

Once you've added them, save the changes to the file by pressing CTRL+O and CTRL+X to exit from the editor.

  • Use the following command to activate the JAVA_HOME environment variable:

source ~/.bashrc 
  • Open the environment variable file for Hadoop and configure the JAVA_HOME variable for the Hadoop environment.

nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh 
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 

Step 7: Configuring Hadoop for Ubuntu

  • For configuring Hadoop properly, it is necessary to create two directories namely, "datanode" and "namenode" - inside the home directory of Hadoop, using the followig commands:

mkdir -p ~/hadoopdata/hdfs/namenode 
mkdir -p ~/hadoopdata/hdfs/datanode 

  • To update the Hadoop core-site.xml file, you will need to add your hostname. Begin by confirming your system hostname using the command provided below.

hostname 

  • Then, open the core-site.xml file in the nano editor.

nano $HADOOP_HOME/etc/hadoop/core-site.xml 
  • Configure the core-site.xml file by adding the following lines:

<configuration> 

 <property> <name>fs.defaultFS</name>  

<value>hdfs://localhost:9000</value>  

</property>  

</configuration> 

nano $HADOOP_HOME/etc/hadoop/ hdfs-site.xml 
  • Configure the hdfs-site.xml file by adding the following lines:

<configuration> 

<property> <name>dfs.replication</name>  

<value>1</value>  

</property>  

<property> <name>dfs.namenode.name.dir</name>  

<value>/usr/local/hadoop/data/dfs/namenode</value>  

</property>  

<property> <name>dfs.datanode.data.dir</name>  

<value>/usr/local/hadoop/data/dfs/datanode</value> 

  </property> 

 </configuration> 

  • Open configuration file of MapReduce, with the following command:

nano $HADOOP_HOME/etc/hadoop/ mapred-site.xml 
  • Configure the mapred-site.xml file by adding the following lines:

 <configuration> 

<property>  

<name>mapreduce.framework.name</name> 

 <value>yarn</value>  

</property>  

</configuration> 
  • Open yarn configuration file, with the following command:

nano $HADOOP_HOME/etc/hadoop/ yarn-site.xml 
  • Configure the yarn-site.xml file by adding the following lines:

<configuration>  

<property>  

<name>yarn.nodemanager.aux-services</name>  

<value>mapreduce_shuffle</value> </property> 

  <property>  

<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> 

 <value>org.apache.hadoop.mapred.ShuffleHandler</value> 

</property>  

</configuration> 

Step 8: Running Hadoop Cluster

  • Format the Hadoop file system by running the following command:

hdfs namenode –format 

  • Start the Hadoop daemons by running the following command:

start-dfs.sh 

start-yarn.sh 

  • Verify that the Hadoop daemons are running by running the following command:

jps 

  • To enable Hadoop to listen at ports 8088 and 9870, you will need to allow these ports through the firewall, using the following commands:

sudo ufw allow 8088 
sudo ufw allow 9870 

Note: If you face error like <user is not in sudoer file>, as shown in the below image in green box, follow the below steps:

  • Logout from specific user using the following command:

logout  
  • Enter root user using the following command:

sudo -i 

To access your Hadoop "namenode", open your web browser and enter your IP address followed by the port numbers 9870 and 8088. You'll be prompted to the screen, as shown below:

Conclusion

Hadoop is now successfully installed. You may start efficiently storing and managing your large datasets.

Last updated