AkbarAhmed.com

Engineering Leadership

Introduction

I had configured only YARN in my original post on how to Install Cloudera Hadoop (CDH4) with YARN (MRv2) in Pseudo mode on Ubuntu 12.04 LTS.

Importantly, YARN is not ready for production yet, so we’ll go ahead and install MRv1 to get some production development done.

Stop the YARN Daemons

We first have to stop all daemons associated with YARN only packages.

sudo service hadoop-yarn-resourcemanager stop
sudo service hadoop-yarn-nodemanager stop
sudo service hadoop-mapreduce-historyserver stop

Install the Missing MRv1 Packages

Next, we’ll install 2 packages that are required for Map Reduce v1, but were not also part of the MRv2/YARN installation.

sudo apt-get install hadoop-0.20-mapreduce-jobtracker
sudo apt-get install hadoop-0.20-mapreduce-tasktracker

Start the MapReduce v1 Daemons

sudo service hadoop-0.20-mapreduce-jobtracker start
sudo service hadoop-0.20-mapreduce-tasktracker start

Introduction

Start the HBase Shell

All subsequent commands in this post assume that you are in the HBase shell, which is started via the command listed below.

hbase shell

You should see output similar to:


12/08/12 12:30:52 WARN conf.Configuration: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.92.1-cdh4.0.1, rUnknown, Thu Jun 28 18:13:01 PDT 2012

Create a Table

We will initially create a table named test with one column family named columnfamily1.

Using a long column family name, such as columnfamily1 is a horrible idea in production. Every cell (i.e. every value) in HBase is stored fully qualified. This basically means that long column family names will balloon the amount of disk space required to store your data. In summary, keep your column family names as terse as possible.

create 'table1', 'columnfamily1'

List all Tables

list

You’ll see output similar to:


TABLE
table1 1 row(s) in 0.0370 seconds

Let’s now create a second table so that we can see some of the features of the list command.

create 'test', 'cf1'
list

You will see output similar to:

TABLE
table1
test
2 row(s) in 0.0320 seconds

If we only want to see the test table, or all tables that start with “te”, we can use the following command.

list 'te'

or

list 'te.*'

Manually Insert Data into HBase

If you’re using HBase, then you likely have data sets that are TBs in size. As a result, you’ll never actually insert data manually. However, knowing how to insert data manually could prove useful at times.

To start, I’m going to create a new table named cars. My column family is vi, which is an abbreviation of vehicle information.

The schema that follows below is only for illustration purposes, and should not be used to create a production schema. In production, you should create a Row ID that helps to uniquely identify the row, and that is likely to be used in your queries. Therefore, one possibility would be to shift the Make, Model and Year left and use these items in the Row ID.

create 'cars', 'vi'

Let’s insert 3 column qualifies (make, model, year) and the associated values into the first row (row1).

put 'cars', 'row1', 'vi:make', 'bmw'
put 'cars', 'row1', 'vi:model', '5 series'
put 'cars', 'row1', 'vi:year', '2012'

Now let’s add a second row.

put 'cars', 'row2', 'vi:make', 'mercedes'
put 'cars', 'row2', 'vi:model', 'e class'
put 'cars', 'row2', 'vi:year', '2012'

Scan a Table (i.e. Query a Table)

We’ll start with a basic scan that returns all columns in the cars table.

scan 'cars'

You should see output similar to:

ROW           COLUMN+CELL
 row1          column=vi:make, timestamp=1344817012999, value=bmw
 row1          column=vi:model, timestamp=1344817020843, value=5 series
 row1          column=vi:year, timestamp=1344817033611, value=2012
 row2          column=vi:make, timestamp=1344817104923, value=mercedes
 row2          column=vi:model, timestamp=1344817115463, value=e class
 row2          column=vi:year, timestamp=1344817124547, value=2012
2 row(s) in 0.6900 seconds

Reading the output above you’ll notice that the Row ID is listed under ROW. The COLUMN+CELL field shows the column family after column=, then the column qualifier, a timestamp that is automatically created by HBase, and the value.

Importantly, each row in our results shows an individual row id + column family + column qualifier combination. Therefore, you’ll notice that multiple columns in a row are displayed in multiple rows in our results.

The next scan we’ll run will limit our results to the make column qualifier.

scan 'cars', {COLUMNS => ['vi:make']}

If you have a particularly large result set, you can limit the number of rows returned with the LIMIT option. In this example I arbitrarily limit the results to 1 row to demonstrate how LIMIT works.

scan 'cars', {COLUMNS => ['vi:make'], LIMIT => 1}

To learn more about the scan command enter the following:

help 'scan'

Get One Row

The get command allows you to get one row of data at a time. You can optionally limit the number of columns returned.

We’ll start by getting all columns in row1.

get 'cars', 'row1'

You should see output similar to:


COLUMN                   CELL
 vi:make                 timestamp=1344817012999, value=bmw
 vi:model                timestamp=1344817020843, value=5 series
 vi:year                 timestamp=1344817033611, value=2012
3 row(s) in 0.0150 seconds

When looking at the output above, you should notice how the results under COLUMN show the fully qualified column family:column qualifier, such as vi:make.

To get one specific column include the COLUMN option.

get 'cars', 'row1', {COLUMN => 'vi:model'}

You can also get two or more columns by passing an array of columns.

get 'cars', 'row1', {COLUMN => ['vi:model', 'vi:year']}

To learn more about the get command enter:

help 'get'

Delete a Cell (Value)

delete 'cars', 'row2', 'vi:year'

Let’s check that our delete worked.

get 'cars', 'row2'

You should see output that shows 2 columns.


COLUMN    CELL
vi:make   timestamp=1344817104923, value=mercedes
vi:model   timestamp=1344817115463, value=e class
2 row(s) in 0.0080 seconds

Disable and Delete a Table

disable 'cars'
drop 'cars'
disable 'table1'
drop 'table1'
disable 'test'
drop 'test'

View HBase Command Help

help

Exit the HBase Shell

exit

Introduction

I ran into an annoying error in HBase due to the localhost loopback. The solution was simple, but took some trial and error.

Error

I was following the HBase logs with the following command:

tail -1000f /var/log/hbase/hbase-hbase-master-freshstart.log

The following error kept poping up in the log file.

org.apache.hadoop.hbase.master.AssignmentManager: Unable to find a viable location to assign region -ROOT

Solution

sudo vi /etc/hosts

I changed:


127.0.0.1       localhost
127.0.1.1       freshstart

to:


#127.0.0.1      localhost
#127.0.1.1      freshstart
192.168.2.15   freshstart
127.0.0.1      localhost

192.168.2.15 is my internal IP address, and freshstart is my hostname.

At this point I rebooted as a quick and dirty way to restart all Hadoop / HBase services. Alternatively, you can start/stop all services.

Additional Thoughts

Updating the hosts file is an option for me currently since I have everything installed on a single machine. However, it seems that this error is a name resolution related issue, so a properly configured DNS server is likely necessary when deploying Hadoop / HBase in a production cluster.

Introduction

RazorSQL is a GUI tool for working with Postgresql.

Install RazorSQL

First, create the razorsql directory.

mkdir ~/Downloads/razorsql

Download RazorSQL into the ~/Downloads/navicat directory.

  1. Open a web browser to http://www.razorsql.com/download_linux.html.
  2. Click Download next to Linux (64-bit).

After the download completes, open a command prompt and enter the following commands. I have assumed that you downloaded the .zip file to the Downloads/razorsql directory.

cd ~/Downloads/razorsql
unzip razorsql5_6_4_linux_x64.zip
mv razorsql ~/bin/
cd ~/bin/razorsql
chmod 755 razorsql.sh
./razorsql.sh

Connect to a Database

The following steps are performed in the RazorSQL GUI.

  • From the menu, select Connections, click Add Connection Profil.
  • In the Connection Wizard, select Postgresql, click Continue.
    Add Connection Profile 01
  • In the 2nd screen, enter the information shown below.
    Add Connection Profile 02
  • Click Connect.

Introduction

The default installation of Postgresql does not allow TCP/IP connections, which also means JDBC connections wont’ work.

Enable TCP/IP Connections

cd /etc/postgresql/9.1/main

Let’s make a backup of the original config file.

sudo cp postgresql.conf postgresql.conf.org
sudo vi postgresql.conf

Find the line that contains

#listen_addresses = 'localhost'

and change it to:

listen_addresses = 'localhost'

If you want to allow remote connections, then you need to set listen_address to something like (assuming your machine’s IP Address is 192.168.1.2):
listen_addresses = ‘127.0.0.1 192.168.1.2’

You will also need to update the firewall to allow connections to this machine.

Restart the Postgresql Server

sudo service postgresql restart

Test the TCP/IP Connection

psql -h 127.0.0.1 -p 5433 -U postgres -W

Introduction

After installing Postgresql on Ubuntu, you will need to complete a few basic setup steps.

Set the Password for the postgres User

The first step is to update the password for the postgres user.

sudo -u postgres psql postgres

Then enter the following command at the postgres=# prompt.

\password postgres

Create a Database

If you exited the postgres=# prompt, then enter the following command to connect to Postgresql. However, if your prompt still looks like postgres=#, then do not enter the psql command below.

sudo -u postgres psql postgres

Enter the following command at the postgres=# prompt.

First we’ll list all of the databases, which should be the following: postgres, template0, and template1.

\l
CREATE DATABASE practicedb;
\l

Now you should see the practicedb database.

Press Ctrl + D to exit the Postgres prompt.

Introduction

Postgresql is an open source object relational database. It is often thought of as an alternative to MySQL.

In this post I’ll provide the steps required to install Postgresql on a developer laptop, assuming that work will be done in both SQL and Python.

You can learn more at:

Installation

Open a command prompt and enter the following commands.

sudo apt-get install postgresql postgresql-common postgresql-contrib \
postgresql-client postgresql-client-common \
pgsnap pgadmin3 pgpool2 ptop pgtune pgloader pgagent \
python-pygresql postgresql-plpython-9.1 python-psycopg2

You’ll see a list of additional packages that will be installed by default.

The following extra packages will be installed:
libossp-uuid16 libpgpool0 libpq5 libwxbase2.8-0 libwxgtk2.8-0 pgadmin3-data php5-cli php5-common php5-pgsql postgresql-9.1 postgresql-client-9.1 postgresql-contrib-9.1 python-support

Installed Packages

Each of the installed packages are described below.

postgresql
This is the core Postgresql object relational database server.

postgresql-common
Add the ability to setup a cluster of Postgresql database servers.

postgresql-client
Command line client for interacting with a Postgresql database, including the psql command.

postgresql-client-common
Allows multiple clients to be installed as part of a Postgresql cluster.

pgsnap
Generates a Postgresql performance report in HTML format.

pgadmin3
GUI tool to work with Postgresql. Personally, I prefer some of the commercial tools, such as Navicat on Linux or EMS SQL Manager on Windows.

pgpool2
Middleware between a Postgresql client and a Postgresql server that provides connection pooling, replication and load balancing, among other things.

ptop
CLI based performance monitoring tool that’s analogous to the Linux psql command

pgtune
Automatically tunes the Postgresql configuration file postgresql.conf based on the system’s hardware.

pgloader
Utility for loading flat files and CSV files into a Postgresql table.

pgagent
Job scheduler for Postgresql.

python-pygresql
Python module that allows you to query a Postgresql database from a python script. Basically, this is for python developers who need to query Postgresql.

postgresql-plpython-9.1
Allows SQL developers to extend their SQL script by writing procedural functions in python.

python-psycopg2
Similar to python-pygresql, except that python-psycopg2 is designed for heavily threaded python scripts that create and destroy a large number of cursors, and execute a high volume of INSERTs and UPDATEs.

I was going through my old notes as I’m cleaning our my closet. I came across the following notes, that I like a lot, on how to write a professional email.

  1. Use email macros to both improve your efficiency in replying to email, and to improve quality.
  2. End each sentence with the correct punctuation.
  3. Write short sentences.
  4. Each sentence should discuss only one topic.
  5. Use bulleted lists to simplify that information you are trying to convey.
    1. Bulleted lists are easy to read.
    2. Each point is separated from other points.
    3. Lists help to organize thoughts.
  6. Use the following correctly:
    1. Capitalization
    2. Spacing
    3. Formatting
  7. Do not use SMS fake words – ur, dk, lol, how ru, etc.
  8. Make the Subject descriptive.
  9. Limit the email to the subject matter described in the Subject.
  10. Attach files before writing the email. No body likes a forgotten attachments.

There are more, but I would put these as my top 10 list.

Introduction

We’re going to start with a very simple Pig script that reads a file that contains 2 numbers per line separated by a comma. The Pig script will first read the line, store each of the 2 numbers in separate variables, and will then add the numbers together.

Create the Sample Input File

cd
vi pig-practice01.txt

Paste the following into pig-practice01.txt.

5	1
6	4
3	2
1	1
9	2
3	8

Create the Input and Output Directories in HDFS

We’re going to create 2 directories to store the input to and output from our first pig script.

hadoop fs -mkdir pig01-input
hadoop fs -mkdir pig01-output

Put Data File into HDFS

hadoop fs -put pig-practice01.txt pig01-input

Now, let’s check that our file was put from our local file system to HDFS correctly.

hadoop fs -ls pig01-input
hadoop fs -cat pig01-input/pig-practice01.txt

Write the Pig Latin Script

vi practice01.pig

Paste the following code into practice01.pig.

/*
Add 2 numbers together
*/

-- Load the practice file from HDFS
A = LOAD 'pig01-input/pig-practice01.txt' USING PigStorage() AS (x:int, y:int);

-- Add x and y 
B = FOREACH A GENERATE x + y;

-- Show the output
STORE B INTO 'pig01-output/results' USING PigStorage();

Run the Pig Script

pig practice01.pig

View the Results

hadoop fs -ls pig01-output/results

The results are stored in the part* file.

hadoop fs -cat pig01-output/results/part-m-0000

Additional Reading

Introduction

Installing Pig is drop dead simple.

Installation

sudo apt-get install pig

Check the Pig version.

pig --version

Setup the Environment

We’re going to set the environment variables system-wide for Pig programming.

sudo vi /etc/environment

Paste the following environment variables into the environment file.

HADOOP_MAPRED_HOME="/usr/lib/hadoop-mapreduce"
PIG_CONF_DIR="/etc/pig/conf"
source  /etc/environment

That’s it. You can now start to write and run pig jobs.

%d bloggers like this: