Start/Stop Hadoop Services (CDH4)

Introduction

Starting and stopping services is a common part of Hadoop administration.

Hadoop Region Server

Start a Region Server

sudo service hbase-regionserver start

Stop a Region Server

sudo service hbase-regionserver stop

Check the Status of a Region Server

sudo service hbase-regionserver status
Advertisements

Install Zookeeper for Cloudera Hadoop (CHD4) in Pseudo mode on Ubuntu 12.04 LTS

Introduction

Zookeeper provides cluster management for Hadoop.

In this post, I’m going to install Zookeeper in Pseudo mode, so please use these instructions for setting up a developer’s workstation, not for a production cluster.

Installation

The zookeeper package should already be installed, but we’ll double check.

sudo apt-get install zookeeper

Next, we’ll install the Zookeeper Server.

sudo apt-get install zookeeper-server

The following files are now installed:
/etc/zookeeper/conf/zoo.cfg: Zookeeper configuration file

sudo service zookeeper-server stop
sudo service zookeeper-server start

If you have installed Zookeeper before installing HBase, you will see the following error message:

Using config: /etc/zookeeper/conf/zoo.cfg
ZooKeeper data directory is missing at /var/lib/zookeeper fix the path or run initialize
invoke-rc.d: initscript zookeeper-server, action "start" failed.

You need to initialize Zookeeper when it’s installed before HBase.

sudo service zookeeper-server init

Now you can start Zookeeper.

sudo service zookeeper-server start

Additional Reading

HBase 0.92.1 Files and Directories (CDH4)

Introduction

You will need to know the location of binaries, configuration files, and libraries when working with HBase.

Directories

Configuration

/etc/hbase/conf is the location for all of HBase’s configuration files.

HBase uses Debian Alternatives, so there are a number of symlinks to the configuration files.

/etc/hbase/conf is a symlink to /etc/alternatives/hbase-conf.
/etc/alternatives/hbase-conf is a symlink to /etc/hbase/conf.dist

Logs

/var/log/hbase contains all of the HBase log files.

Files

Configuration Files

The following configuration files are located in /etc/hbase/conf

hadoop-metrics.properties

TODO

hbase-env.sh

TODO

hbase-policy.xml

TODO

hbase-site.xml

TODO

log4j.properties

TODO

regionservers

TODO

Zookeeper 3.4.3 Files and Directories (CDH4)

Introduction

You will need to know the location of binaries, configuration files, and libraries when working with Zookeeper.

Zookeeper 3.4.3 is a part of Cloudera Distribution Hadoop (CDH4).

Directories

/etc/zookeeper/conf

/etc/zookeeper/conf is the location for all of Zookeeper’s configuration files.

Zookeeper uses Debian Alternatives, so there are a number of symlinks to the configuration files.

/etc/zookeeper/conf is a symlink to /etc/alternatives/zookeeper-conf.
/etc/alternatives/zookeeper-conf is a symlink to /etc/zookeeper/conf.dist

Files

Configuration Files

The following configuration files are located in /etc/zookeeper/conf

configuration.xsl

log4j.properties

zoo.cfg

zoo.cfg is the main Zookeeper configuration file.

dataDir

dataDir specifies the directory where znode snapshot files and transaction logs are stored. These files are important as you will need them to recover data.

The files located in dataDir should be backed up regularly.

zoo_sample.cfg

A sample configuration file. One of the more interesting notes is about the autopurge.snapRetainCount configuration variable (http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance).

Init Files

zookeeper-server

Use the init script to start, stop, restart, check the status of zookeeper, and initial zookeeper.

Binaries and Scripts

/usr/lib/zookeeper/bin/zkCleanup.sh

Script that cleans up the files created in dataDir. This script should be modified per installation and should be added to cron for periodic cleanup.

What is sed?

Introduction

sed is short for Stream EDitor, which is a utility that allow you to parse and transform text one line at a time. sed is a useful tool, along with grep and awk, when manipulating text files. It is also often overlooked when working with Hadoop, although the use of sed, awk and grep can help speed up processing times by preprocessing text before sending it to a MapReduce job.

Change the Hadoop MapReduce v2 (YARN) ShuffleHandler Port

Introduction

If you are running Hadoop on a development machine, then it’s likely that you’ll run into a situation where multiple services require port 8080. I recently ran into this issue where both the Pentaho User Console and the Hadoop MapReduce ShuffleHandler were trying to use port 8080.

One solution is to change the port used by the Hadoop MapReduce ShuffleHandler, which is what I’m going to configure below.

Configuration

sudo vi /etc/hadoop/conf/mapred-site.xml

Add the following as a new property by adding it just before the </configuration> element.

  <property>
    <name>mapreduce.shuffle.port</name>
    <value>8080</value>
    <description>Default port that the ShuffleHandler will run on. ShuffleHandler is a service run at the NodeManager to facilitate transfers of intermediate Map outputs to requesting Reducers.</description>
  </property>

Then restart the YARN daemons.

Install Sqoop 1.4.1 for Cloudera Hadoop (CHD4) on Ubuntu 12.04 LTS

Introduction

Sqoop is a tool to import data from an SQL database into Hadoop and/or export data from Hadoop into an SQL database.

Sqoop can import/export from HDFS, HBase and Hive.

It’s extremely common to use SQL databases as part of the setup in for Hadoop. Often, a SQL database will serve as an upstream datasource, such as a persistence layer for an MQ server, and as a downstream repository, such as a datamart in a BI reporting layer.

Installation

First, we’re going to install MapReduce 1 (MRv1) and the Hadoop Client as these are dependencies for sqoop.

After these two packages are installed, we will need to verify that MRv2 is running, and not MRv1.

sudo apt-get install hadoop-client hadoop-0.20-mapreduce
sudo apt-get install sqoop

The sqoop configuration files are installed into the following directory:
/etc/sqoop/conf which is a symlink to /etc/sqoop/conf.dist

To use Sqoop with YARN (MRv2) we need to verify that the HADOOP_MAPRED_HOME environment variable is set to the correct path.

There are 3 places where we should verify this variable.

grep HADOOP_MAPRED_HOME /etc/default/hadoop

The output should be:
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce

grep HADOOP_MAPRED_HOME /etc/default/hadoop-mapreduce-historyserver

The output should be:
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce

grep HADOOP_MAPRED_HOME /etc/hadoop/conf/hadoop-env.sh

The output should be:
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce

Lastly, I recommend that the HADOOP_MAPRED_HOME be set as a system-wide environment variable to help ease development for the software engineers (assuming you’re only going to use YARN. If you’re using MRv1, then don’t set this variable).

sudo bash -c 'echo HADOOP_MAPRED_HOME=\"/usr/lib/hadoop-mapreduce\" >> /etc/environment'
source /etc/environment
echo $HADOOP_MAPRED_HOME

Finally, we’ll verify that the environment variable is correctly set for the sudo user.

sudo env | grep HADOOP_MAPRED_HOME

Verify the sqoop installation

sqoop version

The output should include:
Sqoop 1.4.1-cdh4.0.0

Additional Reading

http://archive.cloudera.com/cdh4/cdh/4/sqoop/SqoopUserGuide.html

Install Cloudera Hadoop (CDH4) with YARN (MRv2) in Pseudo mode on Ubuntu 12.04 LTS

Introduction

These instructions cover a manual installation of the Cloudera CDH4 packages on Ubuntu 12.04 LTS and are based on my following the Cloudera CDH4 Quick Start Guide (CDH4_Quick_Start_Guide_4.0.0.pdf).

Installation prerequisites

sudo apt-get install curl

Verify that Java is installed correctly

First, check that Java is setup correctly for your account.

echo $JAVA_HOME

The output should be:
"/usr/lib/jvm/jdk1.6.0_31"

Next, check that the JAVA_HOME environment variable is setup correctly for the sudo user.

sudo env | grep JAVA_HOME

The output should be:
JAVA_HOME="/usr/lib/jvm/jdk1.6.0_31"

Download the CDH4 package

cd ~/Downloads
mkdir cloudera
cd cloudera

There was no good way of wrapping the link below, so I added an HTML link to the post. This way you can right-click the link and click Copy Link Location, which you can then use to paste into the terminal.

wget http://archive.cloudera.com/cdh4/one-click-install/precise/amd64/cdh4-repository_1.0_all.deb

Install the CDH4 package

sudo dpkg -i cdh4-repository_1.0_all.deb

Install the Cloudera Public GPG Key

curl -s \
http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key \
| sudo apt-key add -
sudo apt-get update

Install CDH4 with YARN in Pseudo mode

sudo apt-get install hadoop-conf-pseudo

This one command will install a large number of packages:

  • bigtop-jsvc
  • bigtop-utils
  • hadoop
  • hadoop-conf-pseudo
  • hadoop-hdfs
  • hadoop-hdfs-datanode
  • hadoop-hdfs-namenode
  • hadoop-hdfs-secondarynamenode
  • hadoop-mapreduce
  • hadoop-mapreduce-historyserver
  • hadoop-yarn
  • hadoop-yarn-nodemanager
  • hadoop-yarn-resourcemanager
  • zookeeper

View the installed files

It’s good practice to view the list of files installed by each package. Specifically, this is a good method to learn about all of the available configuration files.

dpkg -L hadoop-conf-pseudo

Included in the list of files displayed by dpkg are the configuration files (and some other files):

/etc/hadoop/conf.pseudo/yarn-site.xml
/etc/hadoop/conf.pseudo/log4j.properties
/etc/hadoop/conf.pseudo/hdfs-site.xml
/etc/hadoop/conf.pseudo/hadoop-metrics.properties
/etc/hadoop/conf.pseudo/mapred-site.xml
/etc/hadoop/conf.pseudo/README
/etc/hadoop/conf.pseudo/hadoop-env.sh
/etc/hadoop/conf.pseudo/core-site.xml

Format the HDFS filesystem

sudo -u hdfs hdfs namenode -format

I received one warning message when I formatted the HDFS filesystem:
WARN common.Util: Path /var/lib/hadoop-hdfs/cache/hdfs/dfs/name should be specified as a URI in configuration files. Please update hdfs configuration.

Start HDFS

cd
mkdir bin
cd bin
vi hadoop-hdfs-start

Paste the following code in hadoop-hdfs-start:

#!/bin/bash
for service in /etc/init.d/hadoop-hdfs-*
do
sudo $service start
done

chmod +x hadoop-hdfs-start
./hadoop-hdfs-start

Open the NameNode web console at http://localhost:50070.

About the HDFS filesystem

The commands in the sections below are for creating directories in the HDFS filesystem. Importantly, the HDFS directory structure is not the same as the directory structure in ext4 (i.e. your main Linux directory structure).

To view the HDFS directory structure, you basically prefix standard Linux commands with sudo -u hdfs hadoop fs –. Therefore, you will likely find it useful to create a .bash_aliases file that provides an easier way to type these commands.

I have created a sample .bash_aliases file in the following post: Create a .bash_aliases file

I have used my aliases to setup Hadoop as it’s easier to type. For example, I use hls instead of sudo -u hdfs hadoop fs -ls.

Create the HDFS /tmp directory

You don’t need to remove an old /tmp directory if this is the first time you’re installing Hadoop, but I’ll include the command here for completeness.

sudo -u hdfs hadoop fs -rm -r /tmp

Let’s create a new /tmp directory in HDFS.

shmkdir /tmp

Next, we’ll update the permissions on /tmp in HDFS.

shchmod -R 1777 /tmp

Create a user directory

Since this is a setup for development, we will only create one user directory. However, for a cluster or multi-user environment, you should create one user directory per MapReduce user.

Change akbar below to your username.

shmkdir /user/akbar
shchown akbar:akbar /user/akbar

Create the /var/log/hadoop-yarn directory

shmkdir /var/log/hadoop-yarn
shchown yarn:mapred /var/log/hadoop-yarn

Create the staging directory

The Hadoop -mkdir command defaults to -p (create parent directory).

shmkdir /tmp/hadoop-yarn/staging
shchmod -R 1777 /tmp/hadoop-yarn/staging

Create the done_intermediate directory

shmkdir /tmp/hadoop-yarn/staging/history/done_intermediate
shchmod -R 1777 /tmp/hadoop-yarn/staging/history/done_intermediate
shchown -R mapred:mapred /tmp/hadoop-yarn/staging

Verify that the directory structure is setup correctly

shls -R /

The output should look like:

drwxrwxrwt - hdfs supergroup 0 2012-06-25 15:11 /tmp
drwxr-xr-x - hdfs supergroup 0 2012-06-25 15:11 /tmp/hadoop-yarn
drwxrwxrwt - mapred mapred 0 2012-06-25 15:51 /tmp/hadoop-yarn/staging
drwxr-xr-x - mapred mapred 0 2012-06-25 15:51 /tmp/hadoop-yarn/staging/history
drwxrwxrwt - mapred mapred 0 2012-06-25 15:51 /tmp/hadoop-yarn/staging/history/done_intermediate
drwxr-xr-x - hdfs supergroup 0 2012-06-25 15:09 /user
drwxr-xr-x - akbar akbar 0 2012-06-25 15:09 /user/akbar
drwxr-xr-x - hdfs supergroup 0 2012-06-25 13:42 /var
drwxr-xr-x - hdfs supergroup 0 2012-06-25 13:42 /var/log
drwxr-xr-x - yarn mapred 0 2012-06-25 13:42 /var/log/hadoop-yarn

Start YARN

sudo service hadoop-yarn-resourcemanager start
sudo service hadoop-yarn-nodemanager start
sudo service hadoop-mapreduce-historyserver start

I get the following error message when I start the MR History Server:
chown: changing ownership of `/var/log/hadoop-mapreduce': Operation not permitted

However, this error is not significant and can be ignored. It’ll likely be fixed in an update to CDH4/Hadoop.

Run an example application with YARN

Important
We are going to run the sample YARN app as our regular user, so we’ll use the Hadoop aliases, such as hls, and not the sudo Hadoop aliases, such as shls.

In the first command, we’ll enter just a directory name of input. However, you’ll notice that the input is automatically created in our user directory of /user/akbar/input.

hmkdir input

Let’s view our new directory in HDFS.

hls

Or, you can optionally specify the complete path.

hls /user/akbar

Next, we’ll put some files into the HDFS /user/akbar/input directory.

hadoop fs -put /etc/hadoop/conf/*.xml input

Set the HADOOP_MAPRED_HOME environment variable for the current user in our current session.

export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce

Next, we’ll run the sample job which is a simple grep of the file in the input directory, which then outputs the results to the output directory. It’s worth noting that that .jar file is located on the physical filesystem, while the input and output directories in the HDFS filesystem.

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \
grep input output 'dfs[a-z.]+'
hls
hls output
hcat output/part-r-00000

If you have a longer file to view, piping into less will prove helpful, such as:

hcat output/part-r-00000 | less

How to start the various Hadoop services

The following are some additional notes I took on starting the various Hadoop services (in the correct order).

Start the Hadoop namenode

sudo service hadoop-hdfs-namenode start

Start the Hadoop datanode service

sudo service hadoop-hdfs-datanode start

Start the Hadoop secondarynamenode service

sudo service hadoop-hdfs-secondarynamenode start

Start the Hadoop resourcemanager service

sudo service hadoop-yarn-resourcemanager start

Start the Hadoop nodemanager service

sudo service hadoop-yarn-nodemanager start

Start the Hadoop historyserver service

sudo service hadoop-mapreduce-historyserver start

Create a .bash_aliases file

Introduction

This is my personal .bash_aliases file that is mainly used for Cloudera CDH4 (Hadoop) and Pentaho. As a result, many of my aliases are specific to these software packages.

I plan to update this post as my .bash_aliases file expands. I will also push my .bash_aliases file into Git to make it easier to keep up with changes to the file.

How to create a .bash_aliases file

vi ~/.bash_aliases

Paste the following into the file.


# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Personal: ~/.bash_aliases
# Akbar S. Ahmed
#
# Last modified: 2012.06.25
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# ———————————————–
# General
# ———————————————–

alias c=’clear’ # Clear the screen
alias df=’df -Th’ # Disk free space
alias du=’du -h’ # Disk usage
alias h=’history’ # Bash history
alias j=’jobs -l’ # Current running jobs

# ———————————————–
# ls
# ———————————————–

alias lx=’ls -lXB’ # Sort by extension
alias lk=’ls -lSr’ # Sort by size (small to big)
alias lc=’ls -ltcr’ # Sort by change time (old to new)
alias lu=’ls -ltur’ # Sort by change time (new to old)
alias lt=’ls -ltr’ # Sort by date (old to new)

# ———————————————–
# Hadoop Admin (sudo)
# ———————————————–

alias shcat=’sudo -u hdfs hadoop fs -cat’ # Output a file to standard out
alias shchown=’sudo -u hdfs hadoop fs -chown’ # Change ownership
alias shchmod=’sudo -u hdfs hadoop fs -chmod’ # Change permissions
alias shls=’sudo -u hdfs hadoop fs -ls’ # List files
alias shmkdir=’sudo -u hdfs hadoop fs -mkdir’ # Make a directory

# ———————————————–
# Hadoop (regular user)
# ———————————————–

alias hcat=’hadoop fs -cat’ # Output a file to standard out
alias hchown=’hadoop fs -chown’ # Change ownership
alias hchmod=’hadoop fs -chmod’ # Change permissions
alias hls=’hadoop fs -ls’ # List files
alias hmkdir=’hadoop fs -mkdir’ # Make a directory

source ~/.bash_aliases