AkbarAhmed.com

Engineering Leadership

Introduction

Installing the MySQL package on Ubuntu is extremely simple.

Installation

Open a terminal and enter the following commands.

sudo apt-get install mysql-client mysql-navigator mysql-server

Type Y to accept the additional packages. Press Enter.

After downloading and during installation, the MySQL configuration dialogs will display in the terminal.

In the first dialog, press Enter.

Enter a password for the MySQL root user. Press Enter.

Reenter the root password. Press Enter.

That’s it, MySQL is now installed and ready for use.

Introduction

If you are running Hadoop on a development machine, then it’s likely that you’ll run into a situation where multiple services require port 8080. I recently ran into this issue where both the Pentaho User Console and the Hadoop MapReduce ShuffleHandler were trying to use port 8080.

One solution is to change the port used by the Hadoop MapReduce ShuffleHandler, which is what I’m going to configure below.

Configuration

sudo vi /etc/hadoop/conf/mapred-site.xml

Add the following as a new property by adding it just before the </configuration> element.

  <property>
    <name>mapreduce.shuffle.port</name>
    <value>8080</value>
    <description>Default port that the ShuffleHandler will run on. ShuffleHandler is a service run at the NodeManager to facilitate transfers of intermediate Map outputs to requesting Reducers.</description>
  </property>

Then restart the YARN daemons.

Introduction

Sqoop is a tool to import data from an SQL database into Hadoop and/or export data from Hadoop into an SQL database.

Sqoop can import/export from HDFS, HBase and Hive.

It’s extremely common to use SQL databases as part of the setup in for Hadoop. Often, a SQL database will serve as an upstream datasource, such as a persistence layer for an MQ server, and as a downstream repository, such as a datamart in a BI reporting layer.

Installation

First, we’re going to install MapReduce 1 (MRv1) and the Hadoop Client as these are dependencies for sqoop.

After these two packages are installed, we will need to verify that MRv2 is running, and not MRv1.

sudo apt-get install hadoop-client hadoop-0.20-mapreduce
sudo apt-get install sqoop

The sqoop configuration files are installed into the following directory:
/etc/sqoop/conf which is a symlink to /etc/sqoop/conf.dist

To use Sqoop with YARN (MRv2) we need to verify that the HADOOP_MAPRED_HOME environment variable is set to the correct path.

There are 3 places where we should verify this variable.

grep HADOOP_MAPRED_HOME /etc/default/hadoop

The output should be:
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce

grep HADOOP_MAPRED_HOME /etc/default/hadoop-mapreduce-historyserver

The output should be:
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce

grep HADOOP_MAPRED_HOME /etc/hadoop/conf/hadoop-env.sh

The output should be:
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce

Lastly, I recommend that the HADOOP_MAPRED_HOME be set as a system-wide environment variable to help ease development for the software engineers (assuming you’re only going to use YARN. If you’re using MRv1, then don’t set this variable).

sudo bash -c 'echo HADOOP_MAPRED_HOME=\"/usr/lib/hadoop-mapreduce\" >> /etc/environment'
source /etc/environment
echo $HADOOP_MAPRED_HOME

Finally, we’ll verify that the environment variable is correctly set for the sudo user.

sudo env | grep HADOOP_MAPRED_HOME

Verify the sqoop installation

sqoop version

The output should include:
Sqoop 1.4.1-cdh4.0.0

Additional Reading

http://archive.cloudera.com/cdh4/cdh/4/sqoop/SqoopUserGuide.html

Introduction

These instructions cover a manual installation of the Cloudera CDH4 packages on Ubuntu 12.04 LTS and are based on my following the Cloudera CDH4 Quick Start Guide (CDH4_Quick_Start_Guide_4.0.0.pdf).

Installation prerequisites

sudo apt-get install curl

Verify that Java is installed correctly

First, check that Java is setup correctly for your account.

echo $JAVA_HOME

The output should be:
"/usr/lib/jvm/jdk1.6.0_31"

Next, check that the JAVA_HOME environment variable is setup correctly for the sudo user.

sudo env | grep JAVA_HOME

The output should be:
JAVA_HOME="/usr/lib/jvm/jdk1.6.0_31"

Download the CDH4 package

cd ~/Downloads
mkdir cloudera
cd cloudera

There was no good way of wrapping the link below, so I added an HTML link to the post. This way you can right-click the link and click Copy Link Location, which you can then use to paste into the terminal.

wget http://archive.cloudera.com/cdh4/one-click-install/precise/amd64/cdh4-repository_1.0_all.deb

Install the CDH4 package

sudo dpkg -i cdh4-repository_1.0_all.deb

Install the Cloudera Public GPG Key

curl -s \
http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key \
| sudo apt-key add -
sudo apt-get update

Install CDH4 with YARN in Pseudo mode

sudo apt-get install hadoop-conf-pseudo

This one command will install a large number of packages:

  • bigtop-jsvc
  • bigtop-utils
  • hadoop
  • hadoop-conf-pseudo
  • hadoop-hdfs
  • hadoop-hdfs-datanode
  • hadoop-hdfs-namenode
  • hadoop-hdfs-secondarynamenode
  • hadoop-mapreduce
  • hadoop-mapreduce-historyserver
  • hadoop-yarn
  • hadoop-yarn-nodemanager
  • hadoop-yarn-resourcemanager
  • zookeeper

View the installed files

It’s good practice to view the list of files installed by each package. Specifically, this is a good method to learn about all of the available configuration files.

dpkg -L hadoop-conf-pseudo

Included in the list of files displayed by dpkg are the configuration files (and some other files):

/etc/hadoop/conf.pseudo/yarn-site.xml
/etc/hadoop/conf.pseudo/log4j.properties
/etc/hadoop/conf.pseudo/hdfs-site.xml
/etc/hadoop/conf.pseudo/hadoop-metrics.properties
/etc/hadoop/conf.pseudo/mapred-site.xml
/etc/hadoop/conf.pseudo/README
/etc/hadoop/conf.pseudo/hadoop-env.sh
/etc/hadoop/conf.pseudo/core-site.xml

Format the HDFS filesystem

sudo -u hdfs hdfs namenode -format

I received one warning message when I formatted the HDFS filesystem:
WARN common.Util: Path /var/lib/hadoop-hdfs/cache/hdfs/dfs/name should be specified as a URI in configuration files. Please update hdfs configuration.

Start HDFS

cd
mkdir bin
cd bin
vi hadoop-hdfs-start

Paste the following code in hadoop-hdfs-start:

#!/bin/bash
for service in /etc/init.d/hadoop-hdfs-*
do
sudo $service start
done

chmod +x hadoop-hdfs-start
./hadoop-hdfs-start

Open the NameNode web console at http://localhost:50070.

About the HDFS filesystem

The commands in the sections below are for creating directories in the HDFS filesystem. Importantly, the HDFS directory structure is not the same as the directory structure in ext4 (i.e. your main Linux directory structure).

To view the HDFS directory structure, you basically prefix standard Linux commands with sudo -u hdfs hadoop fs –. Therefore, you will likely find it useful to create a .bash_aliases file that provides an easier way to type these commands.

I have created a sample .bash_aliases file in the following post: Create a .bash_aliases file

I have used my aliases to setup Hadoop as it’s easier to type. For example, I use hls instead of sudo -u hdfs hadoop fs -ls.

Create the HDFS /tmp directory

You don’t need to remove an old /tmp directory if this is the first time you’re installing Hadoop, but I’ll include the command here for completeness.

sudo -u hdfs hadoop fs -rm -r /tmp

Let’s create a new /tmp directory in HDFS.

shmkdir /tmp

Next, we’ll update the permissions on /tmp in HDFS.

shchmod -R 1777 /tmp

Create a user directory

Since this is a setup for development, we will only create one user directory. However, for a cluster or multi-user environment, you should create one user directory per MapReduce user.

Change akbar below to your username.

shmkdir /user/akbar
shchown akbar:akbar /user/akbar

Create the /var/log/hadoop-yarn directory

shmkdir /var/log/hadoop-yarn
shchown yarn:mapred /var/log/hadoop-yarn

Create the staging directory

The Hadoop -mkdir command defaults to -p (create parent directory).

shmkdir /tmp/hadoop-yarn/staging
shchmod -R 1777 /tmp/hadoop-yarn/staging

Create the done_intermediate directory

shmkdir /tmp/hadoop-yarn/staging/history/done_intermediate
shchmod -R 1777 /tmp/hadoop-yarn/staging/history/done_intermediate
shchown -R mapred:mapred /tmp/hadoop-yarn/staging

Verify that the directory structure is setup correctly

shls -R /

The output should look like:

drwxrwxrwt - hdfs supergroup 0 2012-06-25 15:11 /tmp
drwxr-xr-x - hdfs supergroup 0 2012-06-25 15:11 /tmp/hadoop-yarn
drwxrwxrwt - mapred mapred 0 2012-06-25 15:51 /tmp/hadoop-yarn/staging
drwxr-xr-x - mapred mapred 0 2012-06-25 15:51 /tmp/hadoop-yarn/staging/history
drwxrwxrwt - mapred mapred 0 2012-06-25 15:51 /tmp/hadoop-yarn/staging/history/done_intermediate
drwxr-xr-x - hdfs supergroup 0 2012-06-25 15:09 /user
drwxr-xr-x - akbar akbar 0 2012-06-25 15:09 /user/akbar
drwxr-xr-x - hdfs supergroup 0 2012-06-25 13:42 /var
drwxr-xr-x - hdfs supergroup 0 2012-06-25 13:42 /var/log
drwxr-xr-x - yarn mapred 0 2012-06-25 13:42 /var/log/hadoop-yarn

Start YARN

sudo service hadoop-yarn-resourcemanager start
sudo service hadoop-yarn-nodemanager start
sudo service hadoop-mapreduce-historyserver start

I get the following error message when I start the MR History Server:
chown: changing ownership of `/var/log/hadoop-mapreduce': Operation not permitted

However, this error is not significant and can be ignored. It’ll likely be fixed in an update to CDH4/Hadoop.

Run an example application with YARN

Important
We are going to run the sample YARN app as our regular user, so we’ll use the Hadoop aliases, such as hls, and not the sudo Hadoop aliases, such as shls.

In the first command, we’ll enter just a directory name of input. However, you’ll notice that the input is automatically created in our user directory of /user/akbar/input.

hmkdir input

Let’s view our new directory in HDFS.

hls

Or, you can optionally specify the complete path.

hls /user/akbar

Next, we’ll put some files into the HDFS /user/akbar/input directory.

hadoop fs -put /etc/hadoop/conf/*.xml input

Set the HADOOP_MAPRED_HOME environment variable for the current user in our current session.

export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce

Next, we’ll run the sample job which is a simple grep of the file in the input directory, which then outputs the results to the output directory. It’s worth noting that that .jar file is located on the physical filesystem, while the input and output directories in the HDFS filesystem.

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \
grep input output 'dfs[a-z.]+'
hls
hls output
hcat output/part-r-00000

If you have a longer file to view, piping into less will prove helpful, such as:

hcat output/part-r-00000 | less

How to start the various Hadoop services

The following are some additional notes I took on starting the various Hadoop services (in the correct order).

Start the Hadoop namenode

sudo service hadoop-hdfs-namenode start

Start the Hadoop datanode service

sudo service hadoop-hdfs-datanode start

Start the Hadoop secondarynamenode service

sudo service hadoop-hdfs-secondarynamenode start

Start the Hadoop resourcemanager service

sudo service hadoop-yarn-resourcemanager start

Start the Hadoop nodemanager service

sudo service hadoop-yarn-nodemanager start

Start the Hadoop historyserver service

sudo service hadoop-mapreduce-historyserver start

Introduction

This is my personal .bash_aliases file that is mainly used for Cloudera CDH4 (Hadoop) and Pentaho. As a result, many of my aliases are specific to these software packages.

I plan to update this post as my .bash_aliases file expands. I will also push my .bash_aliases file into Git to make it easier to keep up with changes to the file.

How to create a .bash_aliases file

vi ~/.bash_aliases

Paste the following into the file.


# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Personal: ~/.bash_aliases
# Akbar S. Ahmed
#
# Last modified: 2012.06.25
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# ———————————————–
# General
# ———————————————–

alias c=’clear’ # Clear the screen
alias df=’df -Th’ # Disk free space
alias du=’du -h’ # Disk usage
alias h=’history’ # Bash history
alias j=’jobs -l’ # Current running jobs

# ———————————————–
# ls
# ———————————————–

alias lx=’ls -lXB’ # Sort by extension
alias lk=’ls -lSr’ # Sort by size (small to big)
alias lc=’ls -ltcr’ # Sort by change time (old to new)
alias lu=’ls -ltur’ # Sort by change time (new to old)
alias lt=’ls -ltr’ # Sort by date (old to new)

# ———————————————–
# Hadoop Admin (sudo)
# ———————————————–

alias shcat=’sudo -u hdfs hadoop fs -cat’ # Output a file to standard out
alias shchown=’sudo -u hdfs hadoop fs -chown’ # Change ownership
alias shchmod=’sudo -u hdfs hadoop fs -chmod’ # Change permissions
alias shls=’sudo -u hdfs hadoop fs -ls’ # List files
alias shmkdir=’sudo -u hdfs hadoop fs -mkdir’ # Make a directory

# ———————————————–
# Hadoop (regular user)
# ———————————————–

alias hcat=’hadoop fs -cat’ # Output a file to standard out
alias hchown=’hadoop fs -chown’ # Change ownership
alias hchmod=’hadoop fs -chmod’ # Change permissions
alias hls=’hadoop fs -ls’ # List files
alias hmkdir=’hadoop fs -mkdir’ # Make a directory

source ~/.bash_aliases

Introduction

Hopefully you won’t need these instructions due to a botched install, but there may come a time where you need to uninstall a version of the JDK/JVM.

These instructions are for the Oracle JDK 1.7.0 Update 4 on Ubuntu 12.04 LTS. If you are using a different version of the JDK, then change the version numbers listed below.

I have also included instructions for removing the OpenJDK at the bottom of this post.

Uninstall Java

Let’s check the current setup before we uninstall Java.

sudo update-alternatives --display java

The output from the command will be something like:

java - manual mode
link currently points to /usr/lib/jdk1.7.0_04/bin/java
/usr/lib/jdk1.7.0_04/bin/java - priority 1
/usr/lib/jvm/j2sdk1.6-oracle/jre/bin/java - priority 315
slave java.1.gz: /usr/lib/jvm/j2sdk1.6-oracle/man/man1/java.1.gz
/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java - priority 1051
slave java.1.gz: /usr/lib/jvm/java-7-openjdk-amd64/jre/man/man1/java.1.gz
Current 'best' version is '/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java'.

Next, we’ll remove each symlink to a Java binary from the Debian alternatives system. I have split the following commands onto multiple lines to ensure that they display correctly on this page. However, you can remove the \ and then type each command on one line in the terminal.

sudo update-alternatives --remove "java" \
"/usr/lib/jvm/jdk1.7.0_04/bin/java"
sudo update-alternatives --remove "javac" \
"/usr/lib/jvm/jdk1.7.0_04/bin/javac"
sudo update-alternatives --remove "javaws" \
"/usr/lib/jvm/jdk1.7.0_04/bin/javaws"

Let’s quickly verify that the commands above remove the symlinks.

java -version
javac -version
which javaws

You should no longer see 1.7.0 u04 for the version of any of the above commands.

IMPORTANT WARNING
You must type the next 2 commands perfectly to avoid permanently destroying your system. If you do this wrong, you could delete important system files, including those that are required by Ubuntu.

cd /usr/lib/jvm
sudo rm -rf jdk1.7.0_04
sudo update-alternatives --config java

Output:
update-alternatives: error: no alternatives for java.

sudo update-alternatives --config javac

Output:
update-alternatives: error: no alternatives for javac.

sudo update-alternatives --config javaws

Output:
update-alternatives: error: no alternatives for javaws.

sudo vi  /etc/environment

Delete the line with JAVA_HOME

Uninstall OpenJDK (if installed)

First we’ll check which OpenJDK packages are installed.

sudo dpkg --list | grep -i jdk

Next, let’s uninstall OpenJDK related packages. Edit the list of packages below if you have additional or different OpenJDK packages installed.

sudo apt-get purge icedtea-* openjdk-*

Let’s check that all OpenJDK packages have been removed.

sudo dpkg --list | grep -i jdk

Introduction

The first question is why are we installing an old JDK. The answer is that Oracle JDK 6.0 update 31 is the JDK recommended by Cloudera when installing CDH4 (Cloudera Distribution Hadoop v4).

This is an update to an older version of this post. Mainly I have changed the JDK from 1.6.0_26 to 1.6.0_31 as this is the recommended JDK for CDH4.

Install Java

I have a 64 bit version of Ubuntu 12.04 LTS installed, so the instructions below only apply to this OS.

  1. Download the Java JDK from http://www.oracle.com/technetwork/java/javasebusiness/downloads/java-archive-downloads-javase6-419409.html#jdk-6u31-oth-JPR.
  2. Click Accept License Agreement
  3. Click jdk-6u31-linux-x64.bin
  4. Login to Oracle.com with your Oracle account
  5. Download the JDK to your ~/Downloads directory
  6. After downloading, open a terminal, then enter the following commands.
cd ~/Downloads
chmod +x jdk-6u31-linux-x64.bin
./jdk-6u31-linux-x64.bin

Note:
The jvm directory is used to organize all JDK/JVM versions in a single parent directory.

sudo mkdir /usr/lib/jvm
sudo mv jdk1.6.0_31 /usr/lib/jvm

The next 3 commands are split across 2 lines per command due to width limits in the blog’s theme.

sudo update-alternatives --install "/usr/bin/java" "java" \
"/usr/lib/jvm/jdk1.6.0_31/bin/java" 1
sudo update-alternatives --install "/usr/bin/javac" "javac" \
"/usr/lib/jvm/jdk1.6.0_31/bin/javac" 1
sudo update-alternatives --install "/usr/bin/javaws" "javaws" \
"/usr/lib/jvm/jdk1.6.0_31/bin/javaws" 1
sudo update-alternatives --config java

You will see output similar to the following (although it’ll differ on your system). Read through the list and find the number for the Oracle JDK installation (/usr/lib/jvm/jdk1.6.0_26/bin/java)

There are 2 choices for the alternative java (providing /usr/bin/java).

  Selection    Path                                            Priority   Status
------------------------------------------------------------
* 0            /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java   1051      auto mode
  1            /usr/lib/jvm/jdk1.6.0_31/bin/java                1         manual mode
  2            /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java   1051      manual mode

On my system I did the following (change the number that is appropriate for your system):
Press 1 on your keyboard, then press Enter.

sudo update-alternatives --config javac

Follow steps similar to those listed above if you are presented with a list of options. In my case, I had not previously installed the OpenJDK javac binary, so my output looked like the following:

There is only one alternative in link group javac: /usr/lib/jvm/jdk1.6.0_31/bin/javac
Nothing to configure.
sudo update-alternatives --config javaws

As with javac, I did not have the OpenJDK version of javaws installed, so my output was simple. However, if you get a list of options, just type in the number of the path to the Oracle javaws command, and press Enter.

There is only one alternative in link group javaws: /usr/lib/jvm/jdk1.6.0_31/bin/javaws
Nothing to configure.

As a final step, let’s test each of the commands to ensure everything is setup correctly.

java -version

The output should be:
java version "1.6.0_31"
Java(TM) SE Runtime Environment (build 1.6.0_31-b04)
Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode)

javac -version

The output should be:
javac 1.6.0_31

javaws -version

The output should be:
Java(TM) Web Start 1.6.0_31
which is followed by a long usage message.

Create the JAVA_HOME environment variable

Open a terminal, then enter the following commands:

sudo vi /etc/environment

WARNING
WordPress displays the quotes around the JAVA_HOME value below as magic quotes. This will cause problems when you try to use your JVM in certain applications.

Do not copy/paste the JAVA_HOME value below. Or if you do, ensure that you change magic quotes to straight quotes in your editor.

Enter the following at the bottom of the file:
JAVA_HOME="/usr/lib/jvm/jdk1.6.0_31"

Type the following commands to finish the setup and verify that everything is setup correctly.

source /etc/environment
echo $JAVA_HOME

You should see the following output:

/usr/lib/jvm/jdk1.6.0_31

Lastly, verify that JAVA_HOME is set correctly for the sudo user:

sudo env | grep JAVA_HOME

That’s it, the JDK 6.0 update 31 is installed.

Introduction

Kettle is Pentaho’s ETL tool, which is also called Pentaho Data Integration (PDI).

Installing Kettle is extremely simple.

Install Java

Follow the JDK installation instructions that are listed in the following post: Install Java JDK 6.0 update 31 on Ubuntu 12.04 LTS

Download

To download the Kettle either run the following command, or follow the bulleted steps below.

wget http://downloads.sourceforge.net/project/pentaho/Data%20Integration/4.3.0-stable/pdi-ce-4.3.0-stable.tar.gz

Or follow the steps below if you don’t want to use the wget command shown above.

Installation

Next, open a terminal and enter the following commands:

cd ~/Downloads
tar -xzf pdi-ce-4.3.0-stable.tar.gz
mv data-integration ~/bin/pdi-ce-4.3.0
cd ~/bin
ln -s pdi-ce-4.3.0 data-integration
cd ~/bin/data-integration

To run Spoon:

./spoon.sh

Additional Reading

There is a lot of good documentation installed with PDI.

cd ~/bin/data-integration/docs/English

Open your favorite PDF viewer, or type:

evince getting_started_with_pdi.pdf

Introduction

The instructions below assume that you’ve followed my prior blog posts on how to install Pentaho on Ubuntu Linux 12.04 LTS x64.

Download

To download the Pentaho Metadata Editor (PME) either run the following command, or follow the 3 bulleted steps below.

wget http://downloads.sourceforge.net/project/pentaho/Pentaho%20Metadata/4.5.0-stable/pme-ce-4.5.0-stable.tar.gz

Or follow the steps below if you don’t want to use the wget command shown above.

Installation

Enter the following commands:

cd ~/Downloads
tar -xzf pme-ce-4.5.0-stable.tar.gz
mv metadata-editor ~/bin/pme-ce-4.5.0
cd ~/bin
ln -s pme-ce-4.5.0 metadata-editor
vi ~/.profile

Append the following to the end of your PATH:

:$HOME/bin/metadata-editor

For example, my PATH now looks like:
PATH="$HOME/bin:$PATH:$HOME/bin/design-studio:$HOME/bin/metadata-editor"

Save the file, then enter the following command:

source ~/.profile
cd ~/bin/metadata-editor
ln -s metadata-editor.sh pme

Start the Pentaho Metadata Editor with the following command:

pme

Note:
You may get an error message such as:

ERROR 24-05 21:45:04,618 – Pentaho Metadata Editor – Unable to load query : java.io.FileNotFoundException: /home/akbar/.pentaho-meta/.query (No such file or directory)

You can safely ignore this error message for now per wgorman in this post:
http://forums.pentaho.com/showthread.php?72313-Error-Messgae-Pentaho-Metadata-Editor-Unable-to-load-query

“Regarding the .query file not being found, you can safely ignore that message. The MQL Query Editor looks for that file for a saved state.” Will