Big Data Framework Foundation Initial Hadoop Environment Installation Settings

Keywords: Hadoop Java xml Linux

Hadoop is supported by the GNU/Linux platform (recommendation). Therefore, you need to install a Linux operating system and set up a Hadoop environment. If you have a Linux operating system and so on, you can install it in VirtualBox (with the experience of installing Linux in VirtualBox, you can learn to try it without installing it).

Pre-installation settings

Before installing Hadoop, you need to enter the Linux environment and connect to Linux using SSH (Secure Shell). Follow the steps provided below to set up the Linux environment.

Create a

At the beginning, it is recommended to create a separate user Hadoop to isolate the Hadoop file system from the Unix file system. Follow the steps given below to create users:

  • Use the "su" command to open root.
  • Create a user to use the command "useradd username" from the root account.
  • Now you can use the command to open an existing user account "su username".

Open the Linux terminal and enter the following commands to create a user.

  $ su      password:   # useradd hadoop   # passwd hadoop      New passwd:      Retype new passwd

SSH Settings and Key Generation

SSH settings require different operations on the cluster, such as startup, stop, and distributed daemon shell operations. To authenticate different Hadoop users, we need a public/private key pair for Hadoop users and share it with different users.

The following command is used to generate pairs using SSH keys. Copy the public key to form id_rsa.pub into the authorized_keys file and provide the owner with read and write permission for the authorized_keys file.

  $ ssh-keygen -t rsa   $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys   $ chmod 0600 ~/.ssh/authorized_keys

Installing Java

Java is a major prerequisite for Hadoop. First, the command "java-version" should be used to verify that Java exists in the system. The syntax of the Java version command is as follows.

  $ java -version

If everything goes well, it will give the following output.

  java version "1.7.0_71"   Java(TM) SE Runtime Environment (build 1.7.0_71-b13)   Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)

If Java is not installed in the system, follow the steps below to install Java.

Step 1

Download Java (JDK<latest edition>-X64.tar.gz) by visiting the following links: http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads 1880260.html.

Then JDK-7u71-linux-x64.tar.gz will be downloaded to the system.

Step 2

Generally speaking, download Java files in folders. Use the following command to extract the jdk-7u71-linux-x64.gz file.

  $ cd Downloads/   $ ls   jdk-7u71-linux-x64.gz   $ tar zxf jdk-7u71-linux-x64.gz   $ ls   jdk1.7.0_71   jdk-7u71-linux-x64.gz

Step 3

To make Java available to all users, move it to the directory "/ usr/local /". Open the root directory and type the following commands.

  $ su   password:   # mv jdk1.7.0_71 /usr/local/   # exit

Step 4

To set PATH and JAVA_HOME variables, add the following commands to the ~/.bashrc file.

  export JAVA_HOME=/usr/local/jdk1.7.0_71   export PATH=PATH:$JAVA_HOME/bin

Now verify the java-version command from the terminal as described above.

Download Hadoop

Download the software from the Apache Foundation and extract Hadoop 2.4.1 using the following command.

  $ su   password:   # cd /usr/local   # wget http://apache.claz.org/hadoop/common/hadoop-2.4.1/   hadoop-2.4.1.tar.gz   # tar xzf hadoop-2.4.1.tar.gz   # mv hadoop-2.4.1/* to hadoop/   # exit

Hadoop operation mode

After downloading Hadoop, you can operate the Hadoop cluster in one of three supporting modes:

  • Local/stand-alone mode: Download Hadoop in the system and, by default, it will be configured in a stand-alone mode for running Java programs.
  • Analog Distributed Model: This is a distributed simulation on a single machine. Hadoop guards each process, such as hdfs, yarn, MapReduce, etc., and will run as a separate java program. This model is very useful for development.
  • Fully distributed mode: This mode is a fully distributed cluster of two or more computers. We use this model in future chapters.

Install Hadoop in stand-alone mode

Here, we will discuss the installation of Hadoop 2.4.1 in stand-alone mode.

There is a single JVM running any daemon and everything runs. Independent mode is suitable for running MapReduce programs during development because it is easy to test and debug.

Setting up Hadoop

Hadoop environment variables can be set in the ~/.bashrc file by appending the following command.

  export HADOOP_HOME=/usr/local/hadoop

Before we go any further, we need to make sure that Hadoop works properly. Issue the following orders:

  $ hadoop version

If everything is set up properly, you should see the following results:

  Hadoop 2.4.1   Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768   Compiled by hortonmu on 2013-10-07T06:28Z   Compiled with protoc 2.5.0  From source with checksum 79e53ce7994d1628b240f09af91e1af4

This means that Hadoop works well in stand-alone mode. By default, Hadoop is configured to run on a single machine in a non-distributed mode.

Example

Let's look at a simple example of Hadoop. The Hadoop installation provides the following example MapReduce jar file, which provides the basic functions of MapReduce and can be used for computing, such as PI values, word counts in files, and so on.

  $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar

There is an input directory that pushes several files, requiring the total number of words in those files to be calculated. To calculate the total number of words, you don't need to write MapReduce. The. jar file provided contains the number of implemented words. You can try other examples using the same. jar file; issue the following command to check the MapReduce-enabled program through the Hadoop hadoop-mapreduce-examples-2.2.0.jar file.

  $ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduceexamples-2.2.0.jar

Step 1

Create a content file that enters a temporary directory. This input directory can be created anywhere to work.

  $ mkdir input   $ cp $HADOOP_HOME/*.txt input   $ ls -l input

It gives the following files in the input directory:

  total 24   -rw-r--r-- 1 root root 15164 Feb 21 10:14 LICENSE.txt   -rw-r--r-- 1 root root   101 Feb 21 10:14 NOTICE.txt  -rw-r--r-- 1 root root  1366 Feb 21 10:14 README.txt

These files have been copied from the Hadoop installation home directory. For experimentation, you can have different large file sets.

Step 1

Let's start the Hadoop process to count the total number of words in all files available in the input directory, as follows:

  $ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduceexamples-2.2.0.jar  wordcount input ouput

Step 3

Step 2 will do the necessary processing and save the output in the output/part-r00000 file, which can be used by query:

  $cat output/*

It lists all words and provides a total count of their files in all input directories.

  "AS      4   "Contribution" 1   "Contributor" 1   "Derivative 1  "Legal 1  "License"      1  "License");     1   "Licensor"      1  "NOTICE"        1   "Not      1   "Object"        1   "Source"        1   "Work"    1   "You"     1   "Your")   1   "[]"      1   "control"       1   "printed        1   "submitted"     1   (50%)     1   (BIS),    1   (C)       1   (Don't)   1   (ECCN)    1   (INCLUDING      2   (INCLUDING,     2   .............

Installing Hadoop in Simulated Distributed Mode

Follow the steps below to install Hadoop 2.4.1 in pseudo-distributed mode.

Step 1: Set up Hadoop

You can set the Hadoop environment variable in the ~/.bashrc file by attaching the following command.

  export HADOOP_HOME=/usr/local/hadoop   export HADOOP_MAPRED_HOME=$HADOOP_HOME   export HADOOP_COMMON_HOME=$HADOOP_HOME   export HADOOP_HDFS_HOME=$HADOOP_HOME   export YARN_HOME=$HADOOP_HOME   export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native   export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin   export HADOOP_INSTALL=$HADOOP_HOME

Now submit all changes to the currently running system.

  $ source ~/.bashrc

Step 2: Hadoop configuration

You can find all the Hadoop configuration files under'$HADOOP_HOME/etc/hadoop'. These configuration files need to be changed according to the Hadoop infrastructure.

  $ cd $HADOOP_HOME/etc/hadoop

In order to develop Hadoop programs using Java, the JAVA_HOME value must be replaced by the location of Java in the system and the Java environment variables of the hadoop-env.sh file must be reset.

  export JAVA_HOME=/usr/local/jdk1.7.0_71

Following is a list of files that must be edited to configure Hadoop.

core-site.xml

The core-site.xml file contains information such as the port number of the read/write buffer for an instance of Hadoop, which is allocated to the file system for storage to store the limitations and size of the data storage.

Open core-site.xml and add the following attributes between <configuration>, </configuration> tags.

  <configuration>       <property>        <name>fs.default.name </name>        <value> hdfs://localhost:9000 </value>      </property>     </configuration>

hdfs-site.xml

The hdfs-site.xml file contains information such as the value of the replicated data, the NameNode path, and the path of the data node of the local file system. This means that it's where Hadoop's basic tools are stored.

Let's assume the following data.

  dfs.replication (data replication value) = 1   (In the below given path /hadoop/ is the user name.   hadoopinfra/hdfs/namenode is the directory created by hdfs file system.)   namenode path = //home/hadoop/hadoopinfra/hdfs/namenode   (hadoopinfra/hdfs/datanode is the directory created by hdfs file system.)   datanode path = //home/hadoop/hadoopinfra/hdfs/datanode

Open the file and add the following attributes between the <configuration></configuration> tags in the file.

  <configuration>       <property>        <name>dfs.replication</name>        <value>1</value>     </property>           <property>        <name>dfs.name.dir</name>        <value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value>     </property>           <property>        <name>dfs.data.dir</name>         <value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value>      </property>           </configuration>

Note: In the above file, all attribute values are user-defined and can be changed according to your Hadoop infrastructure.

yarn-site.xml

This file is used to configure yarn in Hadoop. Open the yarn-site.xml file and add the following attributes between the <configuration></configuration> tags in the file.

  <configuration>        <property>        <name>yarn.nodemanager.aux-services</name>        <value>mapreduce_shuffle</value>      </property>      </configuration>

mapred-site.xml

This file is used to specify that the MapReduce framework is being used. By default, the template yarn-site.xml containing Hadoop. First, it needs to be replicated from mapred-site.xml. Get the mapred-site.xml template file using the following commands.

  $ cp mapred-site.xml.template mapred-site.xml

Open the mapred-site.xml file and add the following attributes between the <configuration></configuration> tags in the file.

  <configuration>        <property>         <name>mapreduce.framework.name</name>        <value>yarn</value>     </property>       </configuration>

Verify Hadoop installation

The following steps are used to verify the Hadoop installation.

Step 1: Name node settings

Use the command "hdfs namenode-format" to set the name node as follows.

  $ cd ~   $ hdfs namenode -format

The expected results are as follows

  10/24/14 21:30:55 INFO namenode.NameNode: STARTUP_MSG:   /************************************************************   STARTUP_MSG: Starting NameNode   STARTUP_MSG:   host = localhost/192.168.1.11   STARTUP_MSG:   args = [-format]   STARTUP_MSG:   version = 2.4.1   ...  ...  10/24/14 21:30:56 INFO common.Storage: Storage directory   /home/hadoop/hadoopinfra/hdfs/namenode has been successfully formatted.   10/24/14 21:30:56 INFO namenode.NNStorageRetentionManager: Going to   retain 1 images with txid >= 0   10/24/14 21:30:56 INFO util.ExitUtil: Exiting with status 0   10/24/14 21:30:56 INFO namenode.NameNode: SHUTDOWN_MSG:   /************************************************************   SHUTDOWN_MSG: Shutting down NameNode at localhost/192.168.1.11   ************************************************************/

Step 2: Verify Hadoop's DFS

The following command is used to start DFS. Executing this command will start the Hadoop file system.

  $ start-dfs.sh

The expected output is as follows:

  10/24/14 21:37:56   Starting namenodes on [localhost]   localhost: starting namenode, logging to /home/hadoop/hadoop  2.4.1/logs/hadoop-hadoop-namenode-localhost.out   localhost: starting datanode, logging to /home/hadoop/hadoop  2.4.1/logs/hadoop-hadoop-datanode-localhost.out   Starting secondary namenodes [0.0.0.0]

Step 3: Verify the Yarn script

The following command is used to start the yarn script. Executing this command will start the yarn daemon.

  $ start-yarn.sh

The expected output is as follows:

  starting yarn daemons   starting resourcemanager, logging to /home/hadoop/hadoop  2.4.1/logs/yarn-hadoop-resourcemanager-localhost.out   localhost: starting nodemanager, logging to /home/hadoop/hadoop  2.4.1/logs/yarn-hadoop-nodemanager-localhost.out

Step 4: Access Hadoop in the browser

Visit Hadoop's default port number is 50070, and use the following address to get the service of the browser Hadoop.

  http://localhost:50070/

Step 5: Verify the cluster of all applications

The default port number for accessing all applications in the cluster is 8088. Use the following URL to access the service.

Recommended Learning Catalogue: Hadoop Environment Installation Settings

Posted by welshmike on Fri, 26 Jul 2019 21:10:39 -0700