Hadoop Distributions

The following is a repost of my answer to a question on LinkedIn, but I thought it may prove useful to people evaluating Hadoop distributions.

The following is a substantially over simplified set of choices (in alphabetical order):

Amazon: Apache Hadoop provided as a web service. Good solution if your data is collected on Amazon…saves you the trouble of uploading gigs and gigs of data.

Apache: Apache Hadoop is the core code based upon which the various distributions are based.

Cloudera: CHD3 is based on Hadoop 1 (the current stable version) and CDH4 is based on Hadoop 2. CDH is based on Apache Hadoop. The only piece that’s not open source (AFAIK) is Cloudera Manager, which allows you to install up to 50 nodes for free before you go to the paid version. Cloudera is an extremely popular solution that runs on a wide variety of operating systems.

Hortonworks: HDP1 is 100% open source and is based on Hadoop 1. HDP is designed to run on RedHat/CentOS/Oracle Linux.

IBM: IBM BigInsights adds the GPFS filesystem to Hadoop, and is a good choice if your company already is an IBM shop…and you need to integrate with other IBM solutions. Free version is available as InfoSphere BigInsights Basic Edition. Basic Edition does not include all of the value add features found in Enterprise Edition (such as GPFS-SNC).

MapR: MapR uses a proprietary file system plus additional changes to Hadoop that addresses issues with the platform. They have a shared nothing architeture for the NameNode and JobTracker. MapR M3 is available for free, while M5 is a paid version with more features (such as the shared nothing NameNode). People who have used MapR tend to like it.

Understing the Hadoop High Availability (HA) Options

Once you start to use Hadoop in your day-to-day business operations, you’ll quickly find that uptime is an important consideration. No one wants to explain to the CEO why a report is not delivered. While most of Hadoop’s architecture is designed to work in the face of node failure (such as the DataNodes), other components such as the NameNode must be configured with an HA option.

The following is a quick and dirty list of Hadoop HA options:

  • Cloudera CDH4 (free)
    • Uses shared storage
  • Hortonworks (free)
    • Option 1: Use Linux HA (Uses shared storage)
    • Option 2: Use VMWare
  • IBM BigInsights ($$$)
    • GPFS-SNC: Provides a shared nothing HA option
  • MapR M5 ($$$)
    • Shared nothing HA for both NameNode and JobTracker


If you’re brave, you can also apply Facebook’s patches to Apache Hadoop to get an “Avatar” based HA option. This is what FB uses in production.

HBase Command Line Tutorial


Start the HBase Shell

All subsequent commands in this post assume that you are in the HBase shell, which is started via the command listed below.

hbase shell

You should see output similar to:

12/08/12 12:30:52 WARN conf.Configuration: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.92.1-cdh4.0.1, rUnknown, Thu Jun 28 18:13:01 PDT 2012

Create a Table

We will initially create a table named test with one column family named columnfamily1.

Using a long column family name, such as columnfamily1 is a horrible idea in production. Every cell (i.e. every value) in HBase is stored fully qualified. This basically means that long column family names will balloon the amount of disk space required to store your data. In summary, keep your column family names as terse as possible.

create 'table1', 'columnfamily1'

List all Tables


You’ll see output similar to:

table1 1 row(s) in 0.0370 seconds

Let’s now create a second table so that we can see some of the features of the list command.

create 'test', 'cf1'

You will see output similar to:

2 row(s) in 0.0320 seconds

If we only want to see the test table, or all tables that start with “te”, we can use the following command.

list 'te'


list 'te.*'

Manually Insert Data into HBase

If you’re using HBase, then you likely have data sets that are TBs in size. As a result, you’ll never actually insert data manually. However, knowing how to insert data manually could prove useful at times.

To start, I’m going to create a new table named cars. My column family is vi, which is an abbreviation of vehicle information.

The schema that follows below is only for illustration purposes, and should not be used to create a production schema. In production, you should create a Row ID that helps to uniquely identify the row, and that is likely to be used in your queries. Therefore, one possibility would be to shift the Make, Model and Year left and use these items in the Row ID.

create 'cars', 'vi'

Let’s insert 3 column qualifies (make, model, year) and the associated values into the first row (row1).

put 'cars', 'row1', 'vi:make', 'bmw'
put 'cars', 'row1', 'vi:model', '5 series'
put 'cars', 'row1', 'vi:year', '2012'

Now let’s add a second row.

put 'cars', 'row2', 'vi:make', 'mercedes'
put 'cars', 'row2', 'vi:model', 'e class'
put 'cars', 'row2', 'vi:year', '2012'

Scan a Table (i.e. Query a Table)

We’ll start with a basic scan that returns all columns in the cars table.

scan 'cars'

You should see output similar to:

 row1          column=vi:make, timestamp=1344817012999, value=bmw
 row1          column=vi:model, timestamp=1344817020843, value=5 series
 row1          column=vi:year, timestamp=1344817033611, value=2012
 row2          column=vi:make, timestamp=1344817104923, value=mercedes
 row2          column=vi:model, timestamp=1344817115463, value=e class
 row2          column=vi:year, timestamp=1344817124547, value=2012
2 row(s) in 0.6900 seconds

Reading the output above you’ll notice that the Row ID is listed under ROW. The COLUMN+CELL field shows the column family after column=, then the column qualifier, a timestamp that is automatically created by HBase, and the value.

Importantly, each row in our results shows an individual row id + column family + column qualifier combination. Therefore, you’ll notice that multiple columns in a row are displayed in multiple rows in our results.

The next scan we’ll run will limit our results to the make column qualifier.

scan 'cars', {COLUMNS => ['vi:make']}

If you have a particularly large result set, you can limit the number of rows returned with the LIMIT option. In this example I arbitrarily limit the results to 1 row to demonstrate how LIMIT works.

scan 'cars', {COLUMNS => ['vi:make'], LIMIT => 1}

To learn more about the scan command enter the following:

help 'scan'

Get One Row

The get command allows you to get one row of data at a time. You can optionally limit the number of columns returned.

We’ll start by getting all columns in row1.

get 'cars', 'row1'

You should see output similar to:

COLUMN                   CELL
 vi:make                 timestamp=1344817012999, value=bmw
 vi:model                timestamp=1344817020843, value=5 series
 vi:year                 timestamp=1344817033611, value=2012
3 row(s) in 0.0150 seconds

When looking at the output above, you should notice how the results under COLUMN show the fully qualified column family:column qualifier, such as vi:make.

To get one specific column include the COLUMN option.

get 'cars', 'row1', {COLUMN => 'vi:model'}

You can also get two or more columns by passing an array of columns.

get 'cars', 'row1', {COLUMN => ['vi:model', 'vi:year']}

To learn more about the get command enter:

help 'get'

Delete a Cell (Value)

delete 'cars', 'row2', 'vi:year'

Let’s check that our delete worked.

get 'cars', 'row2'

You should see output that shows 2 columns.

vi:make   timestamp=1344817104923, value=mercedes
vi:model   timestamp=1344817115463, value=e class
2 row(s) in 0.0080 seconds

Disable and Delete a Table

disable 'cars'
drop 'cars'
disable 'table1'
drop 'table1'
disable 'test'
drop 'test'

View HBase Command Help


Exit the HBase Shell


Debugging HBase: org.apache.hadoop.hbase.master.AssignmentManager: Unable to find a viable location to assign region -ROOT


I ran into an annoying error in HBase due to the localhost loopback. The solution was simple, but took some trial and error.


I was following the HBase logs with the following command:

tail -1000f /var/log/hbase/hbase-hbase-master-freshstart.log

The following error kept poping up in the log file.

org.apache.hadoop.hbase.master.AssignmentManager: Unable to find a viable location to assign region -ROOT


sudo vi /etc/hosts

I changed:       localhost       freshstart


#      localhost
#      freshstart   freshstart      localhost is my internal IP address, and freshstart is my hostname.

At this point I rebooted as a quick and dirty way to restart all Hadoop / HBase services. Alternatively, you can start/stop all services.

Additional Thoughts

Updating the hosts file is an option for me currently since I have everything installed on a single machine. However, it seems that this error is a name resolution related issue, so a properly configured DNS server is likely necessary when deploying Hadoop / HBase in a production cluster.

How to add numbers with Pig


We’re going to start with a very simple Pig script that reads a file that contains 2 numbers per line separated by a comma. The Pig script will first read the line, store each of the 2 numbers in separate variables, and will then add the numbers together.

Create the Sample Input File

vi pig-practice01.txt

Paste the following into pig-practice01.txt.

5	1
6	4
3	2
1	1
9	2
3	8

Create the Input and Output Directories in HDFS

We’re going to create 2 directories to store the input to and output from our first pig script.

hadoop fs -mkdir pig01-input
hadoop fs -mkdir pig01-output

Put Data File into HDFS

hadoop fs -put pig-practice01.txt pig01-input

Now, let’s check that our file was put from our local file system to HDFS correctly.

hadoop fs -ls pig01-input
hadoop fs -cat pig01-input/pig-practice01.txt

Write the Pig Latin Script

vi practice01.pig

Paste the following code into practice01.pig.

Add 2 numbers together

-- Load the practice file from HDFS
A = LOAD 'pig01-input/pig-practice01.txt' USING PigStorage() AS (x:int, y:int);

-- Add x and y 

-- Show the output
STORE B INTO 'pig01-output/results' USING PigStorage();

Run the Pig Script

pig practice01.pig

View the Results

hadoop fs -ls pig01-output/results

The results are stored in the part* file.

hadoop fs -cat pig01-output/results/part-m-0000

Additional Reading

Install Pig 0.9.2 for CDH4 on Ubuntu 12.04 LTS x64


Installing Pig is drop dead simple.


sudo apt-get install pig

Check the Pig version.

pig --version

Setup the Environment

We’re going to set the environment variables system-wide for Pig programming.

sudo vi /etc/environment

Paste the following environment variables into the environment file.

source  /etc/environment

That’s it. You can now start to write and run pig jobs.

How to view files in HDFS (hadoop fs -ls)


The hadoop fs -ls command allows you to view the files and directories in your HDFS filesystem, much as the ls command works on Linux / OS X / *nix.

Default Home Directory in HDFS
A user’s home directory in HDFS is located at /user/userName. For example, my home directory is /user/akbar.

List the Files in Your Home Directory

hadoop fs -ls defaults to /user/userName, so you can leave the path blank to view the contents of your home directory.

hadoop fs -ls

Recursively List Files

The following command will recursively list all files in the /tmp/hadoop-yarn directory.

hadoop fs -ls -R /tmp/hadoop-yarn

Show List Output in Human Readable Format

Human readable format will show each file’s size, such as 1461, as 1.4k.

hadoop fs -ls -h /user/akbar/input

You will see output similar to:
-rw-r–r–   1 akbar akbar       1.4k 2012-06-25 16:45 /user/akbar/input/core-site.xml
-rw-r–r–   1 akbar akbar       1.8k 2012-06-25 16:45 /user/akbar/input/hdfs-site.xml
-rw-r–r–   1 akbar akbar       1.3k 2012-06-25 16:45 /user/akbar/input/mapred-site.xml
-rw-r–r–   1 akbar akbar       2.2k 2012-06-25 16:45 /user/akbar/input/yarn-site.xml

List Information About a Directory

By default, hadoop fs -ls shows the contents of a directory. But what if you want to view information about the directory, not the directory’s contents?

To show information about a directory, use the -d option.

hadoop fs -ls -d /user/akbar

drwxr-xr-x   – akbar akbar          0 2012-07-07 02:28 /user/akbar/

Compare the output above to the output without the -d option:
drwxr-xr-x   – akbar akbar          0 2012-06-25 16:45 /user/akbar/input
drwxr-xr-x   – akbar akbar          0 2012-06-25 17:09 /user/akbar/output
-rw-r–r–   1 akbar akbar          3 2012-07-07 02:28 /user/akbar/text.hdfs

Show the Usage Statement

hadoop fs -usage ls

The output will be:

Usage: hadoop fs [generic options] -ls [-d] [-h] [-R] [ ...]

hdfs dfsadmin -metasave


hdfs dfsadmin -metasave provides additional information compared to hdfs dfsadmin -report. With hdfs dfsadmin -metasave provides information about blocks, including>

  • blocks waiting for replication
  • blocks currently being replication
  • total number of blocks

hdfs dfsadmin -metasave filename.txt

Run the command with sudo -u hdfs prefixed to ensure you don’t get a permission denied error. CDH4 runs the namenode as the hdfs user by default. However if you have changed the

ssudo -u hdfs hdfs dfsadmin -metasave metasave-report.txt

You will see output similar to:

Created file metasave-report.txt on server hdfs://localhost:8020

The output above initially confused me as I thought the metasave report was saved to the HDFS filesystem. However, it’s stating the the metasave report is saved into the /var/log/hadoop-hdfs directory on localhost.

cd /var/log/hadoop-hdfs
cat metasave-report.txt

You will see output similar to:

58 files and directories, 17 blocks = 75 total
Live Datanodes: 1
Dead Datanodes: 0
Metasave: Blocks waiting for replication: 0
Mis-replicated blocks that have been postponed:
Metasave: Blocks being replicated: 0
Metasave: Blocks 0 waiting deletion from 0 datanodes.
Metasave: Number of datanodes: 1 IN 247241674752(230.26 GB) 323584(316 KB) 0% 220983930880(205.81 GB) Sat Jul 14 18:52:49 PDT 2012

hdfs dfsadmin -report


hdfs dfsadmin -report outputs a brief report on the overall HDFS filesystem. It’s a userful command to quickly view how much disk is available, how many datanodes are running, and so on.


Run the command with sudo -u hdfs prefixed to ensure you don’t get a permission denied error. CDH4 runs the namenode as the hdfs user by default. However if you have changed the

sudo -u hdfs hdfs dfsadmin -report

You will see output similar to:

Configured Capacity: 247241674752 (230.26 GB)
Present Capacity: 221027041280 (205.85 GB)
DFS Remaining: 221026717696 (205.85 GB)
DFS Used: 323584 (316 KB)
DFS Used%: 0%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

Datanodes available: 1 (1 total, 0 dead)

Live datanodes:
Name: (localhost)
Hostname: freshstart
Decommission Status : Normal
Configured Capacity: 247241674752 (230.26 GB)
DFS Used: 323584 (316 KB)
Non DFS Used: 26214633472 (24.41 GB)
DFS Remaining: 221026717696 (205.85 GB)
DFS Used%: 0%
DFS Remaining%: 89.4%
Last contact: Sat Jul 14 18:07:18 PDT 2012

Depricated Command

hadoop dfsadmin -report is a deprecated command. If you enter hadoop dfsadmin -report, you will see the report with the following note at the top of the output.

DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Install HBase 0.92.1 for Cloudera Hadoop (CHD4) in Pseudo mode on Ubuntu 12.04 LTS


HBase is a tabular-oriented database that runs on top of HDFS. It is modeled on Google’s BigTable.

In this post, I’m going to install HBase in Pseudo mode, so please use these instructions for setting up a developer’s workstation, not for a production cluster.

When should you use HBase

HBase should be used when you need random read/write access to the data in Hadoop. While HBase gives you random seeks, it does so at the expense of performance vs. HDFS. Therefore, it is important to look at your workload and pick the correct solution for your specific requirements.

Install Zookeeper

Install Zookeeper before installing HBase.

Install Prerequisites

sudo apt-get install ntp libopts25


sudo apt-get install hbase

Let’s see what files were installed. I have written an HBase Files and Directories post that contains more information about what’s installed with the hbase package.

dpkg -L hbase | less
sudo apt-get install hbase-master

Next, we’ll stop the HBase Master.

sudo service hbase-master stop

Configure HBase to run in pseudo mode

Let’s check the hostname and port used by the HDFS Name Node.

grep -A 1 fs.default.name /etc/hadoop/conf.pseudo/core-site.xml | grep value

You should see output of:

cd /etc/hbase/conf; ls -l
sudo vi hbase-site.xml

Paste the following into hbase-site.xml, between <configuration> and </configuration>.


Add the /hbase directory to HDFS

The following commands assume that you’ve followed the instructions in my post on how to Create a .bash_aliases file.

shmkdir /hbase
shchown hbase /hbase

Let’s check that the /hbase directory was created correctly in HDFS.

hls /

You should see output that includes a line for the /hbase directory.

Start the HBase Master

sudo service hbase-master start

Install an HBase Region Server

The HBase Region Server is started automatically when you install it in Ubuntu.

sudo apt-get install hbase-regionserver

Check that HBase is Setup Correctly

sudo /usr/lib/jvm/jdk1.6.0_31/bin/jps

You should see output similar to the following (look for QuorumPeerMain, NameNode, DataNode, HRegionServer, and HMaster):

1942   SecondaryNameNode
12783  QuorumPeerMain
1747   NameNode
1171   DataNode
15034  HRegionServer
14755  HMaster
2396   NodeManager
2497   ResourceManager
2152   JobHistoryServer
15441  Jps

Open http://localhost:60010 in a web browser to verify that the HBase Master was installed correctly.

If everything installed correctly then you should see the following:

  • In the Region Servers section, there should be one line for localhost.
  • In the Attributes section, you should see HBase Version = 0.92.1-cdh4.0.0.

Add the JDK 1.6.0 u31 Path to BigTop

This update is required as BigTop uses a fixed array approach to finding JAVA_HOME.

sudo vi /usr/lib/bigtop-utils/bigtop-detect-javahome

Add the following line just below the for candidate in \ line:

/usr/lib/jvm/jdk1.6.0_31 \

Update the hosts file

It’s likely that you’ll get an error due to the localhost loopback.

Update the /etc/hosts file (note: The page that contains these instructions was originally written during HBase debugging).

That’s it. You now have HBase installed and ready for use on a developer’s workstation/laptop.

Additional Reading

There are some additional configuration options for HBase, including: