Convert CDH4 from YARN (MRv2) to MRv1


I had configured only YARN in my original post on how to Install Cloudera Hadoop (CDH4) with YARN (MRv2) in Pseudo mode on Ubuntu 12.04 LTS.

Importantly, YARN is not ready for production yet, so we’ll go ahead and install MRv1 to get some production development done.

Stop the YARN Daemons

We first have to stop all daemons associated with YARN only packages.

sudo service hadoop-yarn-resourcemanager stop
sudo service hadoop-yarn-nodemanager stop
sudo service hadoop-mapreduce-historyserver stop

Install the Missing MRv1 Packages

Next, we’ll install 2 packages that are required for Map Reduce v1, but were not also part of the MRv2/YARN installation.

sudo apt-get install hadoop-0.20-mapreduce-jobtracker
sudo apt-get install hadoop-0.20-mapreduce-tasktracker

Start the MapReduce v1 Daemons

sudo service hadoop-0.20-mapreduce-jobtracker start
sudo service hadoop-0.20-mapreduce-tasktracker start

HBase Command Line Tutorial


Start the HBase Shell

All subsequent commands in this post assume that you are in the HBase shell, which is started via the command listed below.

hbase shell

You should see output similar to:

12/08/12 12:30:52 WARN conf.Configuration: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.92.1-cdh4.0.1, rUnknown, Thu Jun 28 18:13:01 PDT 2012

Create a Table

We will initially create a table named test with one column family named columnfamily1.

Using a long column family name, such as columnfamily1 is a horrible idea in production. Every cell (i.e. every value) in HBase is stored fully qualified. This basically means that long column family names will balloon the amount of disk space required to store your data. In summary, keep your column family names as terse as possible.

create 'table1', 'columnfamily1'

List all Tables


You’ll see output similar to:

table1 1 row(s) in 0.0370 seconds

Let’s now create a second table so that we can see some of the features of the list command.

create 'test', 'cf1'

You will see output similar to:

2 row(s) in 0.0320 seconds

If we only want to see the test table, or all tables that start with “te”, we can use the following command.

list 'te'


list 'te.*'

Manually Insert Data into HBase

If you’re using HBase, then you likely have data sets that are TBs in size. As a result, you’ll never actually insert data manually. However, knowing how to insert data manually could prove useful at times.

To start, I’m going to create a new table named cars. My column family is vi, which is an abbreviation of vehicle information.

The schema that follows below is only for illustration purposes, and should not be used to create a production schema. In production, you should create a Row ID that helps to uniquely identify the row, and that is likely to be used in your queries. Therefore, one possibility would be to shift the Make, Model and Year left and use these items in the Row ID.

create 'cars', 'vi'

Let’s insert 3 column qualifies (make, model, year) and the associated values into the first row (row1).

put 'cars', 'row1', 'vi:make', 'bmw'
put 'cars', 'row1', 'vi:model', '5 series'
put 'cars', 'row1', 'vi:year', '2012'

Now let’s add a second row.

put 'cars', 'row2', 'vi:make', 'mercedes'
put 'cars', 'row2', 'vi:model', 'e class'
put 'cars', 'row2', 'vi:year', '2012'

Scan a Table (i.e. Query a Table)

We’ll start with a basic scan that returns all columns in the cars table.

scan 'cars'

You should see output similar to:

 row1          column=vi:make, timestamp=1344817012999, value=bmw
 row1          column=vi:model, timestamp=1344817020843, value=5 series
 row1          column=vi:year, timestamp=1344817033611, value=2012
 row2          column=vi:make, timestamp=1344817104923, value=mercedes
 row2          column=vi:model, timestamp=1344817115463, value=e class
 row2          column=vi:year, timestamp=1344817124547, value=2012
2 row(s) in 0.6900 seconds

Reading the output above you’ll notice that the Row ID is listed under ROW. The COLUMN+CELL field shows the column family after column=, then the column qualifier, a timestamp that is automatically created by HBase, and the value.

Importantly, each row in our results shows an individual row id + column family + column qualifier combination. Therefore, you’ll notice that multiple columns in a row are displayed in multiple rows in our results.

The next scan we’ll run will limit our results to the make column qualifier.

scan 'cars', {COLUMNS => ['vi:make']}

If you have a particularly large result set, you can limit the number of rows returned with the LIMIT option. In this example I arbitrarily limit the results to 1 row to demonstrate how LIMIT works.

scan 'cars', {COLUMNS => ['vi:make'], LIMIT => 1}

To learn more about the scan command enter the following:

help 'scan'

Get One Row

The get command allows you to get one row of data at a time. You can optionally limit the number of columns returned.

We’ll start by getting all columns in row1.

get 'cars', 'row1'

You should see output similar to:

COLUMN                   CELL
 vi:make                 timestamp=1344817012999, value=bmw
 vi:model                timestamp=1344817020843, value=5 series
 vi:year                 timestamp=1344817033611, value=2012
3 row(s) in 0.0150 seconds

When looking at the output above, you should notice how the results under COLUMN show the fully qualified column family:column qualifier, such as vi:make.

To get one specific column include the COLUMN option.

get 'cars', 'row1', {COLUMN => 'vi:model'}

You can also get two or more columns by passing an array of columns.

get 'cars', 'row1', {COLUMN => ['vi:model', 'vi:year']}

To learn more about the get command enter:

help 'get'

Delete a Cell (Value)

delete 'cars', 'row2', 'vi:year'

Let’s check that our delete worked.

get 'cars', 'row2'

You should see output that shows 2 columns.

vi:make   timestamp=1344817104923, value=mercedes
vi:model   timestamp=1344817115463, value=e class
2 row(s) in 0.0080 seconds

Disable and Delete a Table

disable 'cars'
drop 'cars'
disable 'table1'
drop 'table1'
disable 'test'
drop 'test'

View HBase Command Help


Exit the HBase Shell


Debugging HBase: org.apache.hadoop.hbase.master.AssignmentManager: Unable to find a viable location to assign region -ROOT


I ran into an annoying error in HBase due to the localhost loopback. The solution was simple, but took some trial and error.


I was following the HBase logs with the following command:

tail -1000f /var/log/hbase/hbase-hbase-master-freshstart.log

The following error kept poping up in the log file.

org.apache.hadoop.hbase.master.AssignmentManager: Unable to find a viable location to assign region -ROOT


sudo vi /etc/hosts

I changed:       localhost       freshstart


#      localhost
#      freshstart   freshstart      localhost is my internal IP address, and freshstart is my hostname.

At this point I rebooted as a quick and dirty way to restart all Hadoop / HBase services. Alternatively, you can start/stop all services.

Additional Thoughts

Updating the hosts file is an option for me currently since I have everything installed on a single machine. However, it seems that this error is a name resolution related issue, so a properly configured DNS server is likely necessary when deploying Hadoop / HBase in a production cluster.

Install Pyramid on Ubuntu 12.04 LTS in the Rackspace Cloud

Check the Installed Python Version

python --version

You should see the following output:

Python 2.7.3

Install Prerequisites

apt-get install python-setuptools python-pip python-virtualenv virtualenvwrapper

Install Prerequisites for Pyramid Speedups

apt-get install gcc cpp libc6-dev python2.7-dev

Install nginx

apt-get install nginx nginx-full nginx-common

Create a wwwuser that waitress (the web server) will run as

useradd wwwuser -d /home/wwwuser -k /etc/skel -m -s /bin/bash -U

Setup the Virtual Environment

mkdir -p /var/www/
mkdir /var/www/environments
cd /var
chown -R wwwuser:wwwuser www

We are now going to change users to wwwuser user.

su - wwwuser
cd /var/www/environments
virtualenv env_delixus

Install Pyramid

You must perform the following steps as the wwwuser user.

cd /var/www/environments/env_delixus
source bin/activate

You should see the environment name as the prefix in the command prompt, such as:


easy_install Pyramid
pip install waitress

Checkout the Pyramid Project

cd /var/www/

Change the SVN checkout command to something that matches your server. If you use git, then change appropriately.

svn checkout .

Install the Pyramid project

cd /var/www/
vi production.ini

Under [app:main], add a [server:main] configuration as follows:

use = egg:waitress#main
host =
port = %(http_port)s
# default # of threads = 4
threads = 8
url_scheme = http

I don’t think you need to install the development version of the site, but it seems to be the only way that I get everything to work while debugging…go figure.

python develop
pserve development.ini

Then open the site in a text-based web browser.

links http://localhost:6543

You should be able to view your site at this point.

Now, let’s install the production version of the site.

python install

Start Waitress

First we’re going to start and test waitress, then we’ll start it as a deamon.

pserve production.ini http_port=5000
links http://localhost:5000

Again, you should be able to view your site.

pserve production.ini start --daemon --pid-file=/var/www/ \
--log-file=/var/www/5000.log --monitor-restart http_port=5000
pserve production.ini start --daemon --pid-file=/var/www/ \
--log-file=/var/www/5001.log --monitor-restart http_port=5001

Check the waitress process.

ps -ef | grep pserve

You should see the pserve process running.

Configure nginx as a Proxy for Waitress

The following steps must be performed as root.

cd /etc/nginx/sites-available
vi delixus

Paste the following into the delixus.conf file.

upstream delixus-site {

server {
    listen 80;
    server_name  localhost;

    access_log  /var/log/nginx/;

    location / {
        proxy_set_header        Host $host;
        proxy_set_header        X-Real-IP $remote_addr;
        proxy_set_header        X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header        X-Forwarded-Proto $scheme;

        client_max_body_size    10m;
        client_body_buffer_size 128k;
        proxy_connect_timeout   60s;
        proxy_send_timeout      90s;
        proxy_read_timeout      90s;
        proxy_buffering         off;
        proxy_temp_file_write_size 64k;
        proxy_pass http://delixus-site;
        proxy_redirect          off;

    location /static {
        root            /var/www/;
        expires         30d;
        add_header      Cache-Control public;
        access_log      off;

rm /etc/nginx/sites-enabled/default
ln -s /etc/nginx/sites-available/delixus /etc/nginx/sites-enabled/delixus
service nginx stop
service nginx start

A good next step at this point is to setup Supervisor to control pserve/waitress.