HBase 0.92.1 Files and Directories (CDH4)


You will need to know the location of binaries, configuration files, and libraries when working with HBase.



/etc/hbase/conf is the location for all of HBase’s configuration files.

HBase uses Debian Alternatives, so there are a number of symlinks to the configuration files.

/etc/hbase/conf is a symlink to /etc/alternatives/hbase-conf.
/etc/alternatives/hbase-conf is a symlink to /etc/hbase/conf.dist


/var/log/hbase contains all of the HBase log files.


Configuration Files

The following configuration files are located in /etc/hbase/conf














Zookeeper 3.4.3 Files and Directories (CDH4)


You will need to know the location of binaries, configuration files, and libraries when working with Zookeeper.

Zookeeper 3.4.3 is a part of Cloudera Distribution Hadoop (CDH4).



/etc/zookeeper/conf is the location for all of Zookeeper’s configuration files.

Zookeeper uses Debian Alternatives, so there are a number of symlinks to the configuration files.

/etc/zookeeper/conf is a symlink to /etc/alternatives/zookeeper-conf.
/etc/alternatives/zookeeper-conf is a symlink to /etc/zookeeper/conf.dist


Configuration Files

The following configuration files are located in /etc/zookeeper/conf




zoo.cfg is the main Zookeeper configuration file.


dataDir specifies the directory where znode snapshot files and transaction logs are stored. These files are important as you will need them to recover data.

The files located in dataDir should be backed up regularly.


A sample configuration file. One of the more interesting notes is about the autopurge.snapRetainCount configuration variable (http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance).

Init Files


Use the init script to start, stop, restart, check the status of zookeeper, and initial zookeeper.

Binaries and Scripts


Script that cleans up the files created in dataDir. This script should be modified per installation and should be added to cron for periodic cleanup.

Install Cloudera Hadoop (CDH4) with YARN (MRv2) in Pseudo mode on Ubuntu 12.04 LTS


These instructions cover a manual installation of the Cloudera CDH4 packages on Ubuntu 12.04 LTS and are based on my following the Cloudera CDH4 Quick Start Guide (CDH4_Quick_Start_Guide_4.0.0.pdf).

Installation prerequisites

sudo apt-get install curl

Verify that Java is installed correctly

First, check that Java is setup correctly for your account.


The output should be:

Next, check that the JAVA_HOME environment variable is setup correctly for the sudo user.

sudo env | grep JAVA_HOME

The output should be:

Download the CDH4 package

cd ~/Downloads
mkdir cloudera
cd cloudera

There was no good way of wrapping the link below, so I added an HTML link to the post. This way you can right-click the link and click Copy Link Location, which you can then use to paste into the terminal.

wget http://archive.cloudera.com/cdh4/one-click-install/precise/amd64/cdh4-repository_1.0_all.deb

Install the CDH4 package

sudo dpkg -i cdh4-repository_1.0_all.deb

Install the Cloudera Public GPG Key

curl -s \
http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key \
| sudo apt-key add -
sudo apt-get update

Install CDH4 with YARN in Pseudo mode

sudo apt-get install hadoop-conf-pseudo

This one command will install a large number of packages:

  • bigtop-jsvc
  • bigtop-utils
  • hadoop
  • hadoop-conf-pseudo
  • hadoop-hdfs
  • hadoop-hdfs-datanode
  • hadoop-hdfs-namenode
  • hadoop-hdfs-secondarynamenode
  • hadoop-mapreduce
  • hadoop-mapreduce-historyserver
  • hadoop-yarn
  • hadoop-yarn-nodemanager
  • hadoop-yarn-resourcemanager
  • zookeeper

View the installed files

It’s good practice to view the list of files installed by each package. Specifically, this is a good method to learn about all of the available configuration files.

dpkg -L hadoop-conf-pseudo

Included in the list of files displayed by dpkg are the configuration files (and some other files):


Format the HDFS filesystem

sudo -u hdfs hdfs namenode -format

I received one warning message when I formatted the HDFS filesystem:
WARN common.Util: Path /var/lib/hadoop-hdfs/cache/hdfs/dfs/name should be specified as a URI in configuration files. Please update hdfs configuration.

Start HDFS

mkdir bin
cd bin
vi hadoop-hdfs-start

Paste the following code in hadoop-hdfs-start:

for service in /etc/init.d/hadoop-hdfs-*
sudo $service start

chmod +x hadoop-hdfs-start

Open the NameNode web console at http://localhost:50070.

About the HDFS filesystem

The commands in the sections below are for creating directories in the HDFS filesystem. Importantly, the HDFS directory structure is not the same as the directory structure in ext4 (i.e. your main Linux directory structure).

To view the HDFS directory structure, you basically prefix standard Linux commands with sudo -u hdfs hadoop fs –. Therefore, you will likely find it useful to create a .bash_aliases file that provides an easier way to type these commands.

I have created a sample .bash_aliases file in the following post: Create a .bash_aliases file

I have used my aliases to setup Hadoop as it’s easier to type. For example, I use hls instead of sudo -u hdfs hadoop fs -ls.

Create the HDFS /tmp directory

You don’t need to remove an old /tmp directory if this is the first time you’re installing Hadoop, but I’ll include the command here for completeness.

sudo -u hdfs hadoop fs -rm -r /tmp

Let’s create a new /tmp directory in HDFS.

shmkdir /tmp

Next, we’ll update the permissions on /tmp in HDFS.

shchmod -R 1777 /tmp

Create a user directory

Since this is a setup for development, we will only create one user directory. However, for a cluster or multi-user environment, you should create one user directory per MapReduce user.

Change akbar below to your username.

shmkdir /user/akbar
shchown akbar:akbar /user/akbar

Create the /var/log/hadoop-yarn directory

shmkdir /var/log/hadoop-yarn
shchown yarn:mapred /var/log/hadoop-yarn

Create the staging directory

The Hadoop -mkdir command defaults to -p (create parent directory).

shmkdir /tmp/hadoop-yarn/staging
shchmod -R 1777 /tmp/hadoop-yarn/staging

Create the done_intermediate directory

shmkdir /tmp/hadoop-yarn/staging/history/done_intermediate
shchmod -R 1777 /tmp/hadoop-yarn/staging/history/done_intermediate
shchown -R mapred:mapred /tmp/hadoop-yarn/staging

Verify that the directory structure is setup correctly

shls -R /

The output should look like:

drwxrwxrwt - hdfs supergroup 0 2012-06-25 15:11 /tmp
drwxr-xr-x - hdfs supergroup 0 2012-06-25 15:11 /tmp/hadoop-yarn
drwxrwxrwt - mapred mapred 0 2012-06-25 15:51 /tmp/hadoop-yarn/staging
drwxr-xr-x - mapred mapred 0 2012-06-25 15:51 /tmp/hadoop-yarn/staging/history
drwxrwxrwt - mapred mapred 0 2012-06-25 15:51 /tmp/hadoop-yarn/staging/history/done_intermediate
drwxr-xr-x - hdfs supergroup 0 2012-06-25 15:09 /user
drwxr-xr-x - akbar akbar 0 2012-06-25 15:09 /user/akbar
drwxr-xr-x - hdfs supergroup 0 2012-06-25 13:42 /var
drwxr-xr-x - hdfs supergroup 0 2012-06-25 13:42 /var/log
drwxr-xr-x - yarn mapred 0 2012-06-25 13:42 /var/log/hadoop-yarn

Start YARN

sudo service hadoop-yarn-resourcemanager start
sudo service hadoop-yarn-nodemanager start
sudo service hadoop-mapreduce-historyserver start

I get the following error message when I start the MR History Server:
chown: changing ownership of `/var/log/hadoop-mapreduce': Operation not permitted

However, this error is not significant and can be ignored. It’ll likely be fixed in an update to CDH4/Hadoop.

Run an example application with YARN

We are going to run the sample YARN app as our regular user, so we’ll use the Hadoop aliases, such as hls, and not the sudo Hadoop aliases, such as shls.

In the first command, we’ll enter just a directory name of input. However, you’ll notice that the input is automatically created in our user directory of /user/akbar/input.

hmkdir input

Let’s view our new directory in HDFS.


Or, you can optionally specify the complete path.

hls /user/akbar

Next, we’ll put some files into the HDFS /user/akbar/input directory.

hadoop fs -put /etc/hadoop/conf/*.xml input

Set the HADOOP_MAPRED_HOME environment variable for the current user in our current session.

export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce

Next, we’ll run the sample job which is a simple grep of the file in the input directory, which then outputs the results to the output directory. It’s worth noting that that .jar file is located on the physical filesystem, while the input and output directories in the HDFS filesystem.

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \
grep input output 'dfs[a-z.]+'
hls output
hcat output/part-r-00000

If you have a longer file to view, piping into less will prove helpful, such as:

hcat output/part-r-00000 | less

How to start the various Hadoop services

The following are some additional notes I took on starting the various Hadoop services (in the correct order).

Start the Hadoop namenode

sudo service hadoop-hdfs-namenode start

Start the Hadoop datanode service

sudo service hadoop-hdfs-datanode start

Start the Hadoop secondarynamenode service

sudo service hadoop-hdfs-secondarynamenode start

Start the Hadoop resourcemanager service

sudo service hadoop-yarn-resourcemanager start

Start the Hadoop nodemanager service

sudo service hadoop-yarn-nodemanager start

Start the Hadoop historyserver service

sudo service hadoop-mapreduce-historyserver start