AkbarAhmed.com

Engineering Leadership

Introduction

These instructions cover a manual installation of the Cloudera CDH4 packages on Ubuntu 12.04 LTS and are based on my following the Cloudera CDH4 Quick Start Guide (CDH4_Quick_Start_Guide_4.0.0.pdf).

Installation prerequisites

sudo apt-get install curl

Verify that Java is installed correctly

First, check that Java is setup correctly for your account.

echo $JAVA_HOME

The output should be:
"/usr/lib/jvm/jdk1.6.0_31"

Next, check that the JAVA_HOME environment variable is setup correctly for the sudo user.

sudo env | grep JAVA_HOME

The output should be:
JAVA_HOME="/usr/lib/jvm/jdk1.6.0_31"

Download the CDH4 package

cd ~/Downloads
mkdir cloudera
cd cloudera

There was no good way of wrapping the link below, so I added an HTML link to the post. This way you can right-click the link and click Copy Link Location, which you can then use to paste into the terminal.

wget http://archive.cloudera.com/cdh4/one-click-install/precise/amd64/cdh4-repository_1.0_all.deb

Install the CDH4 package

sudo dpkg -i cdh4-repository_1.0_all.deb

Install the Cloudera Public GPG Key

curl -s \
http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key \
| sudo apt-key add -
sudo apt-get update

Install CDH4 with YARN in Pseudo mode

sudo apt-get install hadoop-conf-pseudo

This one command will install a large number of packages:

  • bigtop-jsvc
  • bigtop-utils
  • hadoop
  • hadoop-conf-pseudo
  • hadoop-hdfs
  • hadoop-hdfs-datanode
  • hadoop-hdfs-namenode
  • hadoop-hdfs-secondarynamenode
  • hadoop-mapreduce
  • hadoop-mapreduce-historyserver
  • hadoop-yarn
  • hadoop-yarn-nodemanager
  • hadoop-yarn-resourcemanager
  • zookeeper

View the installed files

It’s good practice to view the list of files installed by each package. Specifically, this is a good method to learn about all of the available configuration files.

dpkg -L hadoop-conf-pseudo

Included in the list of files displayed by dpkg are the configuration files (and some other files):

/etc/hadoop/conf.pseudo/yarn-site.xml
/etc/hadoop/conf.pseudo/log4j.properties
/etc/hadoop/conf.pseudo/hdfs-site.xml
/etc/hadoop/conf.pseudo/hadoop-metrics.properties
/etc/hadoop/conf.pseudo/mapred-site.xml
/etc/hadoop/conf.pseudo/README
/etc/hadoop/conf.pseudo/hadoop-env.sh
/etc/hadoop/conf.pseudo/core-site.xml

Format the HDFS filesystem

sudo -u hdfs hdfs namenode -format

I received one warning message when I formatted the HDFS filesystem:
WARN common.Util: Path /var/lib/hadoop-hdfs/cache/hdfs/dfs/name should be specified as a URI in configuration files. Please update hdfs configuration.

Start HDFS

cd
mkdir bin
cd bin
vi hadoop-hdfs-start

Paste the following code in hadoop-hdfs-start:

#!/bin/bash
for service in /etc/init.d/hadoop-hdfs-*
do
sudo $service start
done

chmod +x hadoop-hdfs-start
./hadoop-hdfs-start

Open the NameNode web console at http://localhost:50070.

About the HDFS filesystem

The commands in the sections below are for creating directories in the HDFS filesystem. Importantly, the HDFS directory structure is not the same as the directory structure in ext4 (i.e. your main Linux directory structure).

To view the HDFS directory structure, you basically prefix standard Linux commands with sudo -u hdfs hadoop fs –. Therefore, you will likely find it useful to create a .bash_aliases file that provides an easier way to type these commands.

I have created a sample .bash_aliases file in the following post: Create a .bash_aliases file

I have used my aliases to setup Hadoop as it’s easier to type. For example, I use hls instead of sudo -u hdfs hadoop fs -ls.

Create the HDFS /tmp directory

You don’t need to remove an old /tmp directory if this is the first time you’re installing Hadoop, but I’ll include the command here for completeness.

sudo -u hdfs hadoop fs -rm -r /tmp

Let’s create a new /tmp directory in HDFS.

shmkdir /tmp

Next, we’ll update the permissions on /tmp in HDFS.

shchmod -R 1777 /tmp

Create a user directory

Since this is a setup for development, we will only create one user directory. However, for a cluster or multi-user environment, you should create one user directory per MapReduce user.

Change akbar below to your username.

shmkdir /user/akbar
shchown akbar:akbar /user/akbar

Create the /var/log/hadoop-yarn directory

shmkdir /var/log/hadoop-yarn
shchown yarn:mapred /var/log/hadoop-yarn

Create the staging directory

The Hadoop -mkdir command defaults to -p (create parent directory).

shmkdir /tmp/hadoop-yarn/staging
shchmod -R 1777 /tmp/hadoop-yarn/staging

Create the done_intermediate directory

shmkdir /tmp/hadoop-yarn/staging/history/done_intermediate
shchmod -R 1777 /tmp/hadoop-yarn/staging/history/done_intermediate
shchown -R mapred:mapred /tmp/hadoop-yarn/staging

Verify that the directory structure is setup correctly

shls -R /

The output should look like:

drwxrwxrwt - hdfs supergroup 0 2012-06-25 15:11 /tmp
drwxr-xr-x - hdfs supergroup 0 2012-06-25 15:11 /tmp/hadoop-yarn
drwxrwxrwt - mapred mapred 0 2012-06-25 15:51 /tmp/hadoop-yarn/staging
drwxr-xr-x - mapred mapred 0 2012-06-25 15:51 /tmp/hadoop-yarn/staging/history
drwxrwxrwt - mapred mapred 0 2012-06-25 15:51 /tmp/hadoop-yarn/staging/history/done_intermediate
drwxr-xr-x - hdfs supergroup 0 2012-06-25 15:09 /user
drwxr-xr-x - akbar akbar 0 2012-06-25 15:09 /user/akbar
drwxr-xr-x - hdfs supergroup 0 2012-06-25 13:42 /var
drwxr-xr-x - hdfs supergroup 0 2012-06-25 13:42 /var/log
drwxr-xr-x - yarn mapred 0 2012-06-25 13:42 /var/log/hadoop-yarn

Start YARN

sudo service hadoop-yarn-resourcemanager start
sudo service hadoop-yarn-nodemanager start
sudo service hadoop-mapreduce-historyserver start

I get the following error message when I start the MR History Server:
chown: changing ownership of `/var/log/hadoop-mapreduce': Operation not permitted

However, this error is not significant and can be ignored. It’ll likely be fixed in an update to CDH4/Hadoop.

Run an example application with YARN

Important
We are going to run the sample YARN app as our regular user, so we’ll use the Hadoop aliases, such as hls, and not the sudo Hadoop aliases, such as shls.

In the first command, we’ll enter just a directory name of input. However, you’ll notice that the input is automatically created in our user directory of /user/akbar/input.

hmkdir input

Let’s view our new directory in HDFS.

hls

Or, you can optionally specify the complete path.

hls /user/akbar

Next, we’ll put some files into the HDFS /user/akbar/input directory.

hadoop fs -put /etc/hadoop/conf/*.xml input

Set the HADOOP_MAPRED_HOME environment variable for the current user in our current session.

export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce

Next, we’ll run the sample job which is a simple grep of the file in the input directory, which then outputs the results to the output directory. It’s worth noting that that .jar file is located on the physical filesystem, while the input and output directories in the HDFS filesystem.

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \
grep input output 'dfs[a-z.]+'
hls
hls output
hcat output/part-r-00000

If you have a longer file to view, piping into less will prove helpful, such as:

hcat output/part-r-00000 | less

How to start the various Hadoop services

The following are some additional notes I took on starting the various Hadoop services (in the correct order).

Start the Hadoop namenode

sudo service hadoop-hdfs-namenode start

Start the Hadoop datanode service

sudo service hadoop-hdfs-datanode start

Start the Hadoop secondarynamenode service

sudo service hadoop-hdfs-secondarynamenode start

Start the Hadoop resourcemanager service

sudo service hadoop-yarn-resourcemanager start

Start the Hadoop nodemanager service

sudo service hadoop-yarn-nodemanager start

Start the Hadoop historyserver service

sudo service hadoop-mapreduce-historyserver start

25 thoughts on “Install Cloudera Hadoop (CDH4) with YARN (MRv2) in Pseudo mode on Ubuntu 12.04 LTS

  1. Nadir Vardar says:

    after I run “sudo apt-get install hadoop-conf-pseudo”

    —–
    Reading package lists… Done
    Building dependency tree
    Reading state information… Done
    Some packages could not be installed. This may mean that you have
    requested an impossible situation or if you are using the unstable
    distribution that some required packages have not yet been created
    or been moved out of Incoming.
    The following information may help to resolve the situation:

    The following packages have unmet dependencies:
    hadoop-conf-pseudo : Depends: hadoop-hdfs-namenode (= 2.0.0+91-1.cdh4.0.1.p0.1~precise-cdh4.0.1) but it is not going to be installed
    Depends: hadoop-hdfs-datanode (= 2.0.0+91-1.cdh4.0.1.p0.1~precise-cdh4.0.1) but it is not going to be installed
    Depends: hadoop-hdfs-secondarynamenode (= 2.0.0+91-1.cdh4.0.1.p0.1~precise-cdh4.0.1) but it is not going to be installed
    E: Unable to correct problems, you have held broken packages.

    I get the above exception, do you have any idea why ?

    Thanks,

    1. akbarsahmed says:

      Hi Nadir,

      I’m not sure of the exact cause, but the following commands may help to solve the problem.

      sudo apt-get install -f

      Then try to install hadoop-conf-pseudo again:

      sudo apt-get install hadoop-conf-pseudo

      Akbar

  2. Behzad Pirvali says:

    Hi Akbar,

    I am getting the same error and install -f did not help either 😦
    Is there a way to manually these three base packages that could not installed manually and then try the hadoop-conf-pseudo again?

    Thanks,
    Behzad

    1. shoshy says:

      its because your ubuntu is 32 bit and CDH 4 requires 64 bit. I’ve been through this… 😦

  3. Ronak Mitra says:

    Error while executing :

    ./hadoop-hdfs-start

    Getting :

    * Starting Hadoop datanode:
    /usr/bin/env: bash: No such file or directory
    * Starting Hadoop namenode:
    /usr/bin/env: bash: No such file or directory
    * Starting Hadoop secondarynamenode:
    /usr/bin/env: bash: No such file or directory

    localhost not working after this.

    1. akbarsahmed says:

      Which OS and version are you using?

      1. Ronak Mitra says:

        I am using ubuntu 12.
        I set path in /etc/environment .

      2. akbarsahmed says:

        Check your quotes in /etc/environment…you may have magical quotes.

  4. sonya says:

    sudo apt-get install hadoop-conf-pseudo
    Reading package lists… Done
    Building dependency tree
    Reading state information… Done
    Some packages could not be installed. This may mean that you have
    requested an impossible situation or if you are using the unstable
    distribution that some required packages have not yet been created
    or been moved out of Incoming.
    The following information may help to resolve the situation:

    The following packages have unmet dependencies:
    hadoop-conf-pseudo : Depends: hadoop-hdfs-namenode (= 2.0.0+552-1.cdh4.1.2.p0.27~precise-cdh4.1.2) but it is not going to be installed
    Depends: hadoop-hdfs-datanode (= 2.0.0+552-1.cdh4.1.2.p0.27~precise-cdh4.1.2) but it is not going to be installed
    Depends: hadoop-hdfs-secondarynamenode (= 2.0.0+552-1.cdh4.1.2.p0.27~precise-cdh4.1.2) but it is not going to be installed
    E: Unable to correct problems, you have held broken packages.

    how to fix dis??
    i had installed hadoop-0.20 early and uninstalled it

    1. akbarsahmed says:

      The first step in fixing the error is to correct the issues reported by apt.

      You need to install:
      hadoop-hdfs-namenode
      hadoop-hdfs-datanode
      hadoop-hdfs-secondarynamenode

  5. Dan Teok says:

    Hi Akbar
    Excellent step-by-step tutorials you’ve got here!

    Hoping you can help a little on the following as I am stuck at the “Install the Cloudera Public GPG Key” step.
    I’ve done the
    “curl -s \
    http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key \
    | sudo apt-key add -”
    And then, sudo apt-get update.

    I checked my Download directory and it now has a .deb file.
    But, the problem is the next step:
    sudo apt-get install hadoop-conf-pseudo

    It is complaining that the “hadoop-conf-pseudo” does not exists, therefore, install failed.

    I’ve followed each lines of your codes above and description carefully, including the installation of JDK on your blog. Would you have any idea what might have gone wrong?

    1. akbarsahmed says:

      Hi Dan,

      Did you install cdh4-repository_1.0_all.deb?

      Run the following:
      ls

      Do you see cdh4-repository_1.0_all.deb? If so, then continue.

      sudo dpkg -i cdh4-repository_1.0_all.deb

      One this is done, you should be able to run the following.

      sudo apt-get install hadoop-conf-pseudo

      Also, it looks like Cloudera is running some maintenance right now (3 am PST in California)…so that could be the problem if everything else is fine.

      Lastly, I wrote this post a while ago, so there is an update to CDH that you can download from the Cloudera site.

      Hope this helps.

      Akbar

    2. Syed Arif says:

      Hi Dan Teok,

      Facing similar issue, Did you resolve the issue ?

      1. Dan Teok says:

        @SyedArif
        Hi Syed, No, I haven’t been able to resolve the issue. Akbar still has not replied to my last comment. I have searched the Internet but to no avail.

  6. Dan Teok says:

    Hi Akbar, Thanks for the reply above.
    Now I have the time to revisit this, and yes, there was a cdh4-repository_1.0_all.deb.

    It still says Unable to locate package hadoop-conf-pseudo. I’m still stuck at this point. My timezone here is GMT (London)

    Here is the terminal output:

    user_dan@user_dan-Production:~/Downloads/cloudera$ ll
    total 12
    drwxrwxr-x 2 user_dan user_dan 4096 Feb 24 20:06 ./
    drwxr-xr-x 3 user_dan user_dan 4096 Feb 24 20:06 ../
    -rw-rw-r– 1 user_dan user_dan 3304 Dec 1 01:51 cdh4-repository_1.0_all.deb
    user_dan@duser_danntvli-Production:~/Downloads/cloudera$ sudo dpkg -i cdh4-repository_1.0_all.deb
    [sudo] password for user_dan:
    Selecting previously unselected package cdh4-repository.
    (Reading database … 171165 files and directories currently installed.)
    Unpacking cdh4-repository (from cdh4-repository_1.0_all.deb) …
    Setting up cdh4-repository (1.0) …
    gpg: keyring `/etc/apt/secring.gpg’ created
    gpg: keyring `/etc/apt/trusted.gpg.d/cloudera-cdh4.gpg’ created
    gpg: key 02A818DD: public key “Cloudera Apt Repository” imported
    gpg: Total number processed: 1
    gpg: imported: 1
    user_dan@user_dan-Production:~/Downloads/cloudera$ sudo apt-get install hadoop-conf-pseudo
    Reading package lists… Done
    Building dependency tree
    Reading state information… Done
    E: Unable to locate package hadoop-conf-pseudo

    user_dan@user_dan-Production:~/Downloads/cloudera$

    1. akbarsahmed says:

      Hi Dan,

      Sorry for the slow reply…work has been hectic.

      It’s a bit hard to diagnose from the log messages.

      I’ll try to find time in the next few days to install the latest CDH version. Once that’s done, I’ll write it up and send you an email.

      One other note, I noticed your hostname has the word ‘Production’ in it, yet you’re installing YARN. YARN is approaching production status, but it’s not there yet. Also, I would not want to be the first one out of the gate to deploy YARN until many of the kinks have been worked out.

      Yahoo is rolling our YARN currently, as are some startups with former FB employees, so YARN will get some early production use in the months ahead. IMO, the data layer is rarely the place to take risks on the latest technology.

      Thanks,
      Akbar

      1. Dan Teok says:

        Hi Akbar,
        Thank you for responding. I look forward to receiving your email.

        Replying to your note: the word ‘production’ is just what I called my machine when Ubuntu asked to give it a name. I didn’t want to have the usual ‘…@dan-pc’ or ‘…@myubuntupc’, etc.

        Looking forward.

        D

  7. Syed Arif says:

    Hi Akbar,

    Guess you have replied the issue faced by “Dan Teok “, i am facing similar issue.
    Kindly share the details… 🙂

  8. Sushant says:

    Hi Akbar,
    I verified Java installation. I had installated Oracle java 6 update 45 downloaded from web8 website. Now I have chnged it to Oracle jdk1.6.0_31. But still I am getting same complain while installing hbase.

    error:
    invoke-rc.d: initscript hbase-master, action “start” failed.
    dpkg: error processing hbase-master (–configure):
    subprocess installed post-installation script returned error exit status 1
    Setting up hbase-regionserver (0.94.2+218-1.cdh4.2.1.p0.8~precise-cdh4.2.1) …
    Starting Hadoop HBase regionserver daemon: +======================================================================+
    | Error: JAVA_HOME is not set and Java could not be found |
    +———————————————————————-+
    | Please download the latest Sun JDK from the Sun Java web site |
    | > http://java.sun.com/javase/downloads/ < |
    | |
    | HBase requires Java 1.6 or later. |
    | NOTE: This script will find Sun Java whether you install using the |
    | binary or the RPM based installer.

    MY OBSERVATION: When I write sudo env | grep JAVA_HOME, I donot get any reply. But when I mention env | grep JAVA_HOME I get the JAVA_HOME path. I feel I have modified bashrc file of sush (only user ) and not that of sudo.
    I have modified /etc/profile for all users but unable to run it using ~$. /etc/profile

    Note that there is only one user "sush". Pl. help me to install jdk for root user.

  9. JC says:

    Thanks for the instructions. Very clear and exact; much easier to follow than the Cloudera instructions 🙂

  10. Sushant says:

    Hi Akbar,
    I sucessfully installed CDH4 in Ubuntu 12.04. Unfortunately while purging zookeeper and reinstalling hadoop CDH4 got currupted. I reinstalled CDH4 after purging the old one but the datanode is not getting started. All nodes are getting started doing sudo jps. But when I mention localhost:50070 in web-browser, the live node is 0. Is there any method of reinstalling datanode or I have to remove hadoop using Synaptic Manager and again reinstall and configure.

    Pl. help.

    Sushant

  11. Aakash says:

    Hi Akbar,

    This was the most comprehensive guide for Hadoop Install:

    However, could you please elaborate on this:
    “I received one warning message when I formatted the HDFS filesystem:
    WARN common.Util: Path /var/lib/hadoop-hdfs/cache/hdfs/dfs/name should be specified as a URI in configuration files. Please update hdfs configuration.”

    I have been trying to configure Cloudera Hadoop in Pseudo mode. However, I am stuck on the above.

    How do I configure HDFS?

    Thank you,

  12. Mahes says:

    Run dpkg -L hadoop-0.20-conf-pseudo instead of
    dpkg -L hadoop-conf-pseudo

Leave a reply to akbarsahmed Cancel reply