Install Cloudera Hadoop (CDH4) with YARN (MRv2) in Pseudo mode on Ubuntu 12.04 LTS

akbarsahmed

June 26, 2012

Introduction

These instructions cover a manual installation of the Cloudera CDH4 packages on Ubuntu 12.04 LTS and are based on my following the Cloudera CDH4 Quick Start Guide (CDH4_Quick_Start_Guide_4.0.0.pdf).

Installation prerequisites

sudo apt-get install curl

Verify that Java is installed correctly

First, check that Java is setup correctly for your account.

echo $JAVA_HOME

The output should be:
"/usr/lib/jvm/jdk1.6.0_31"

Next, check that the JAVA_HOME environment variable is setup correctly for the sudo user.

sudo env | grep JAVA_HOME

The output should be:
JAVA_HOME="/usr/lib/jvm/jdk1.6.0_31"

Download the CDH4 package

cd ~/Downloads

mkdir cloudera

cd cloudera

There was no good way of wrapping the link below, so I added an HTML link to the post. This way you can right-click the link and click Copy Link Location, which you can then use to paste into the terminal.

wget http://archive.cloudera.com/cdh4/one-click-install/precise/amd64/cdh4-repository_1.0_all.deb

Install the CDH4 package

sudo dpkg -i cdh4-repository_1.0_all.deb

Install the Cloudera Public GPG Key

curl -s \
http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key \
| sudo apt-key add -

sudo apt-get update

Install CDH4 with YARN in Pseudo mode

sudo apt-get install hadoop-conf-pseudo

This one command will install a large number of packages:

bigtop-jsvc
bigtop-utils
hadoop
hadoop-conf-pseudo
hadoop-hdfs
hadoop-hdfs-datanode
hadoop-hdfs-namenode
hadoop-hdfs-secondarynamenode
hadoop-mapreduce
hadoop-mapreduce-historyserver
hadoop-yarn
hadoop-yarn-nodemanager
hadoop-yarn-resourcemanager
zookeeper

View the installed files

It’s good practice to view the list of files installed by each package. Specifically, this is a good method to learn about all of the available configuration files.

dpkg -L hadoop-conf-pseudo

Included in the list of files displayed by dpkg are the configuration files (and some other files):
/etc/hadoop/conf.pseudo/yarn-site.xml /etc/hadoop/conf.pseudo/log4j.properties /etc/hadoop/conf.pseudo/hdfs-site.xml /etc/hadoop/conf.pseudo/hadoop-metrics.properties /etc/hadoop/conf.pseudo/mapred-site.xml /etc/hadoop/conf.pseudo/README /etc/hadoop/conf.pseudo/hadoop-env.sh /etc/hadoop/conf.pseudo/core-site.xml

Format the HDFS filesystem

sudo -u hdfs hdfs namenode -format

I received one warning message when I formatted the HDFS filesystem:
WARN common.Util: Path /var/lib/hadoop-hdfs/cache/hdfs/dfs/name should be specified as a URI in configuration files. Please update hdfs configuration.

Start HDFS

cd

mkdir bin

cd bin

vi hadoop-hdfs-start

Paste the following code in hadoop-hdfs-start:
#!/bin/bash for service in /etc/init.d/hadoop-hdfs-* do sudo $service start done

chmod +x hadoop-hdfs-start

./hadoop-hdfs-start

Open the NameNode web console at http://localhost:50070.

About the HDFS filesystem

The commands in the sections below are for creating directories in the HDFS filesystem. Importantly, the HDFS directory structure is not the same as the directory structure in ext4 (i.e. your main Linux directory structure).

To view the HDFS directory structure, you basically prefix standard Linux commands with sudo -u hdfs hadoop fs –. Therefore, you will likely find it useful to create a .bash_aliases file that provides an easier way to type these commands.

I have created a sample .bash_aliases file in the following post: Create a .bash_aliases file

I have used my aliases to setup Hadoop as it’s easier to type. For example, I use hls instead of sudo -u hdfs hadoop fs -ls.

Create the HDFS /tmp directory

You don’t need to remove an old /tmp directory if this is the first time you’re installing Hadoop, but I’ll include the command here for completeness.

sudo -u hdfs hadoop fs -rm -r /tmp

Let’s create a new /tmp directory in HDFS.

shmkdir /tmp

Next, we’ll update the permissions on /tmp in HDFS.

shchmod -R 1777 /tmp

Create a user directory

Since this is a setup for development, we will only create one user directory. However, for a cluster or multi-user environment, you should create one user directory per MapReduce user.

Change akbar below to your username.

shmkdir /user/akbar

shchown akbar:akbar /user/akbar

Create the /var/log/hadoop-yarn directory

shmkdir /var/log/hadoop-yarn

shchown yarn:mapred /var/log/hadoop-yarn

Create the staging directory

The Hadoop -mkdir command defaults to -p (create parent directory).

shmkdir /tmp/hadoop-yarn/staging

shchmod -R 1777 /tmp/hadoop-yarn/staging

Create the done_intermediate directory

shmkdir /tmp/hadoop-yarn/staging/history/done_intermediate

shchmod -R 1777 /tmp/hadoop-yarn/staging/history/done_intermediate

shchown -R mapred:mapred /tmp/hadoop-yarn/staging

Verify that the directory structure is setup correctly

shls -R /

The output should look like:
drwxrwxrwt - hdfs supergroup 0 2012-06-25 15:11 /tmp drwxr-xr-x - hdfs supergroup 0 2012-06-25 15:11 /tmp/hadoop-yarn drwxrwxrwt - mapred mapred 0 2012-06-25 15:51 /tmp/hadoop-yarn/staging drwxr-xr-x - mapred mapred 0 2012-06-25 15:51 /tmp/hadoop-yarn/staging/history drwxrwxrwt - mapred mapred 0 2012-06-25 15:51 /tmp/hadoop-yarn/staging/history/done_intermediate drwxr-xr-x - hdfs supergroup 0 2012-06-25 15:09 /user drwxr-xr-x - akbar akbar 0 2012-06-25 15:09 /user/akbar drwxr-xr-x - hdfs supergroup 0 2012-06-25 13:42 /var drwxr-xr-x - hdfs supergroup 0 2012-06-25 13:42 /var/log drwxr-xr-x - yarn mapred 0 2012-06-25 13:42 /var/log/hadoop-yarn

Start YARN

sudo service hadoop-yarn-resourcemanager start

sudo service hadoop-yarn-nodemanager start

sudo service hadoop-mapreduce-historyserver start

I get the following error message when I start the MR History Server:
chown: changing ownership of `/var/log/hadoop-mapreduce': Operation not permitted

However, this error is not significant and can be ignored. It’ll likely be fixed in an update to CDH4/Hadoop.

Run an example application with YARN

Important
We are going to run the sample YARN app as our regular user, so we’ll use the Hadoop aliases, such as hls, and not the sudo Hadoop aliases, such as shls.

In the first command, we’ll enter just a directory name of input. However, you’ll notice that the input is automatically created in our user directory of /user/akbar/input.

hmkdir input

Let’s view our new directory in HDFS.

hls

Or, you can optionally specify the complete path.

hls /user/akbar

Next, we’ll put some files into the HDFS /user/akbar/input directory.

hadoop fs -put /etc/hadoop/conf/*.xml input

Set the HADOOP_MAPRED_HOME environment variable for the current user in our current session.

export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce

Next, we’ll run the sample job which is a simple grep of the file in the input directory, which then outputs the results to the output directory. It’s worth noting that that .jar file is located on the physical filesystem, while the input and output directories in the HDFS filesystem.

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \
grep input output 'dfs[a-z.]+'

hls

hls output

hcat output/part-r-00000

If you have a longer file to view, piping into less will prove helpful, such as:

hcat output/part-r-00000 | less

How to start the various Hadoop services

The following are some additional notes I took on starting the various Hadoop services (in the correct order).

Start the Hadoop namenode

sudo service hadoop-hdfs-namenode start

Start the Hadoop datanode service

sudo service hadoop-hdfs-datanode start

Start the Hadoop secondarynamenode service

sudo service hadoop-hdfs-secondarynamenode start

Start the Hadoop resourcemanager service

sudo service hadoop-yarn-resourcemanager start

Start the Hadoop nodemanager service

sudo service hadoop-yarn-nodemanager start

Start the Hadoop historyserver service

sudo service hadoop-mapreduce-historyserver start

25 thoughts on “Install Cloudera Hadoop (CDH4) with YARN (MRv2) in Pseudo mode on Ubuntu 12.04 LTS”

Pingback: Today's Linux Server LinksNine OM
Nadir Vardar says:

July 18, 2012 at 10:53 pm

after I run “sudo apt-get install hadoop-conf-pseudo”

—–
Reading package lists… Done
Building dependency tree
Reading state information… Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
hadoop-conf-pseudo : Depends: hadoop-hdfs-namenode (= 2.0.0+91-1.cdh4.0.1.p0.1~precise-cdh4.0.1) but it is not going to be installed
Depends: hadoop-hdfs-datanode (= 2.0.0+91-1.cdh4.0.1.p0.1~precise-cdh4.0.1) but it is not going to be installed
Depends: hadoop-hdfs-secondarynamenode (= 2.0.0+91-1.cdh4.0.1.p0.1~precise-cdh4.0.1) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.

I get the above exception, do you have any idea why ?

Thanks,

Reply
1. akbarsahmed says:
  
  July 19, 2012 at 4:58 am
  
  Hi Nadir,
  
  I’m not sure of the exact cause, but the following commands may help to solve the problem.
  
  sudo apt-get install -f
  
  Then try to install hadoop-conf-pseudo again:
  
  sudo apt-get install hadoop-conf-pseudo
  
  Akbar
  
  Reply
Behzad Pirvali says:

July 21, 2012 at 4:11 pm

Hi Akbar,

I am getting the same error and install -f did not help either 😦
Is there a way to manually these three base packages that could not installed manually and then try the hadoop-conf-pseudo again?

Thanks,
Behzad

Reply
1. shoshy says:
  
  October 9, 2012 at 5:22 pm
  
  its because your ubuntu is 32 bit and CDH 4 requires 64 bit. I’ve been through this… 😦
  
  Reply
Pingback: Convert CDH4 from YARN (MRv2) to MRv1 « AkbarAhmed.com
Ronak Mitra says:

August 30, 2012 at 3:28 pm

Error while executing :

./hadoop-hdfs-start

Getting :

* Starting Hadoop datanode:
/usr/bin/env: bash: No such file or directory
* Starting Hadoop namenode:
/usr/bin/env: bash: No such file or directory
* Starting Hadoop secondarynamenode:
/usr/bin/env: bash: No such file or directory

localhost not working after this.

Reply
1. akbarsahmed says:
  
  August 30, 2012 at 4:34 pm
  
  Which OS and version are you using?
  
  Reply
  1. Ronak Mitra says:
    
    August 30, 2012 at 4:41 pm
    
    I am using ubuntu 12.
    I set path in /etc/environment .
  2. akbarsahmed says:
    
    August 30, 2012 at 5:38 pm
    
    Check your quotes in /etc/environment…you may have magical quotes.
sonya says:

February 4, 2013 at 6:24 pm

sudo apt-get install hadoop-conf-pseudo
Reading package lists… Done
Building dependency tree
Reading state information… Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
hadoop-conf-pseudo : Depends: hadoop-hdfs-namenode (= 2.0.0+552-1.cdh4.1.2.p0.27~precise-cdh4.1.2) but it is not going to be installed
Depends: hadoop-hdfs-datanode (= 2.0.0+552-1.cdh4.1.2.p0.27~precise-cdh4.1.2) but it is not going to be installed
Depends: hadoop-hdfs-secondarynamenode (= 2.0.0+552-1.cdh4.1.2.p0.27~precise-cdh4.1.2) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.

how to fix dis??
i had installed hadoop-0.20 early and uninstalled it

Reply
1. akbarsahmed says:
  
  February 8, 2013 at 5:49 am
  
  The first step in fixing the error is to correct the issues reported by apt.
  
  You need to install:
  hadoop-hdfs-namenode
  hadoop-hdfs-datanode
  hadoop-hdfs-secondarynamenode
  
  Reply
Dan Teok says:

February 25, 2013 at 10:34 am

Hi Akbar
Excellent step-by-step tutorials you’ve got here!

Hoping you can help a little on the following as I am stuck at the “Install the Cloudera Public GPG Key” step.
I’ve done the
“curl -s \
http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key \
| sudo apt-key add -”
And then, sudo apt-get update.

I checked my Download directory and it now has a .deb file.
But, the problem is the next step:
sudo apt-get install hadoop-conf-pseudo

It is complaining that the “hadoop-conf-pseudo” does not exists, therefore, install failed.

I’ve followed each lines of your codes above and description carefully, including the installation of JDK on your blog. Would you have any idea what might have gone wrong?

Reply
1. akbarsahmed says:
  
  February 25, 2013 at 10:57 am
  
  Hi Dan,
  
  Did you install cdh4-repository_1.0_all.deb?
  
  Run the following:
  ls
  
  Do you see cdh4-repository_1.0_all.deb? If so, then continue.
  
  sudo dpkg -i cdh4-repository_1.0_all.deb
  
  One this is done, you should be able to run the following.
  
  sudo apt-get install hadoop-conf-pseudo
  
  Also, it looks like Cloudera is running some maintenance right now (3 am PST in California)…so that could be the problem if everything else is fine.
  
  Lastly, I wrote this post a while ago, so there is an update to CDH that you can download from the Cloudera site.
  
  Hope this helps.
  
  Akbar
  
  Reply
2. Syed Arif says:
  
  April 11, 2013 at 11:11 am
  
  Hi Dan Teok,
  
  Facing similar issue, Did you resolve the issue ?
  
  Reply
  1. Dan Teok says:
    
    May 16, 2013 at 4:23 pm
    
    @SyedArif
    Hi Syed, No, I haven’t been able to resolve the issue. Akbar still has not replied to my last comment. I have searched the Internet but to no avail.
Dan Teok says:

February 25, 2013 at 9:14 pm

Hi Akbar, Thanks for the reply above.
Now I have the time to revisit this, and yes, there was a cdh4-repository_1.0_all.deb.

It still says Unable to locate package hadoop-conf-pseudo. I’m still stuck at this point. My timezone here is GMT (London)

Here is the terminal output:

user_dan@user_dan-Production:~/Downloads/cloudera$ ll
total 12
drwxrwxr-x 2 user_dan user_dan 4096 Feb 24 20:06 ./
drwxr-xr-x 3 user_dan user_dan 4096 Feb 24 20:06 ../
-rw-rw-r– 1 user_dan user_dan 3304 Dec 1 01:51 cdh4-repository_1.0_all.deb
user_dan@duser_danntvli-Production:~/Downloads/cloudera$ sudo dpkg -i cdh4-repository_1.0_all.deb
[sudo] password for user_dan:
Selecting previously unselected package cdh4-repository.
(Reading database … 171165 files and directories currently installed.)
Unpacking cdh4-repository (from cdh4-repository_1.0_all.deb) …
Setting up cdh4-repository (1.0) …
gpg: keyring `/etc/apt/secring.gpg’ created
gpg: keyring `/etc/apt/trusted.gpg.d/cloudera-cdh4.gpg’ created
gpg: key 02A818DD: public key “Cloudera Apt Repository” imported
gpg: Total number processed: 1
gpg: imported: 1
user_dan@user_dan-Production:~/Downloads/cloudera$ sudo apt-get install hadoop-conf-pseudo
Reading package lists… Done
Building dependency tree
Reading state information… Done
E: Unable to locate package hadoop-conf-pseudo

user_dan@user_dan-Production:~/Downloads/cloudera$

Reply
1. akbarsahmed says:
  
  March 1, 2013 at 1:45 am
  
  Hi Dan,
  
  Sorry for the slow reply…work has been hectic.
  
  It’s a bit hard to diagnose from the log messages.
  
  I’ll try to find time in the next few days to install the latest CDH version. Once that’s done, I’ll write it up and send you an email.
  
  One other note, I noticed your hostname has the word ‘Production’ in it, yet you’re installing YARN. YARN is approaching production status, but it’s not there yet. Also, I would not want to be the first one out of the gate to deploy YARN until many of the kinks have been worked out.
  
  Yahoo is rolling our YARN currently, as are some startups with former FB employees, so YARN will get some early production use in the months ahead. IMO, the data layer is rarely the place to take risks on the latest technology.
  
  Thanks,
  Akbar
  
  Reply
  1. Dan Teok says:
    
    March 6, 2013 at 1:23 pm
    
    Hi Akbar,
    Thank you for responding. I look forward to receiving your email.
    
    Replying to your note: the word ‘production’ is just what I called my machine when Ubuntu asked to give it a name. I didn’t want to have the usual ‘…@dan-pc’ or ‘…@myubuntupc’, etc.
    
    Looking forward.
    
    D
Syed Arif says:

April 11, 2013 at 10:55 am

Hi Akbar,

Guess you have replied the issue faced by “Dan Teok “, i am facing similar issue.
Kindly share the details… 🙂

Reply
Sushant says:

May 15, 2013 at 9:53 am

Hi Akbar,
I verified Java installation. I had installated Oracle java 6 update 45 downloaded from web8 website. Now I have chnged it to Oracle jdk1.6.0_31. But still I am getting same complain while installing hbase.

error:
invoke-rc.d: initscript hbase-master, action “start” failed.
dpkg: error processing hbase-master (–configure):
subprocess installed post-installation script returned error exit status 1
Setting up hbase-regionserver (0.94.2+218-1.cdh4.2.1.p0.8~precise-cdh4.2.1) …
Starting Hadoop HBase regionserver daemon: +======================================================================+
| Error: JAVA_HOME is not set and Java could not be found |
+———————————————————————-+
| Please download the latest Sun JDK from the Sun Java web site |
| > http://java.sun.com/javase/downloads/ < |
| |
| HBase requires Java 1.6 or later. |
| NOTE: This script will find Sun Java whether you install using the |
| binary or the RPM based installer.

MY OBSERVATION: When I write sudo env | grep JAVA_HOME, I donot get any reply. But when I mention env | grep JAVA_HOME I get the JAVA_HOME path. I feel I have modified bashrc file of sush (only user ) and not that of sudo.
I have modified /etc/profile for all users but unable to run it using ~$. /etc/profile

Note that there is only one user "sush". Pl. help me to install jdk for root user.

Reply
JC says:

May 19, 2013 at 12:22 am

Thanks for the instructions. Very clear and exact; much easier to follow than the Cloudera instructions 🙂

Reply
Sushant says:

May 25, 2013 at 3:33 pm

Hi Akbar,
I sucessfully installed CDH4 in Ubuntu 12.04. Unfortunately while purging zookeeper and reinstalling hadoop CDH4 got currupted. I reinstalled CDH4 after purging the old one but the datanode is not getting started. All nodes are getting started doing sudo jps. But when I mention localhost:50070 in web-browser, the live node is 0. Is there any method of reinstalling datanode or I have to remove hadoop using Synaptic Manager and again reinstall and configure.

Pl. help.

Sushant

Reply
Aakash says:

June 13, 2013 at 6:02 pm

Hi Akbar,

This was the most comprehensive guide for Hadoop Install:

However, could you please elaborate on this:
“I received one warning message when I formatted the HDFS filesystem:
WARN common.Util: Path /var/lib/hadoop-hdfs/cache/hdfs/dfs/name should be specified as a URI in configuration files. Please update hdfs configuration.”

I have been trying to configure Cloudera Hadoop in Pseudo mode. However, I am stuck on the above.

How do I configure HDFS?

Thank you,

Reply
Mahes says:

October 12, 2013 at 3:09 pm

Run dpkg -L hadoop-0.20-conf-pseudo instead of
dpkg -L hadoop-conf-pseudo

Reply

AkbarAhmed.com

Engineering Leadership

Install Cloudera Hadoop (CDH4) with YARN (MRv2) in Pseudo mode on Ubuntu 12.04 LTS

Introduction

Installation prerequisites

Verify that Java is installed correctly

Download the CDH4 package

Install the CDH4 package

Install the Cloudera Public GPG Key

Install CDH4 with YARN in Pseudo mode

View the installed files

Format the HDFS filesystem

Start HDFS

About the HDFS filesystem

Create the HDFS /tmp directory

Create a user directory

Create the /var/log/hadoop-yarn directory

Create the staging directory

Create the done_intermediate directory

Verify that the directory structure is setup correctly

Start YARN

Run an example application with YARN

How to start the various Hadoop services

Start the Hadoop namenode

Start the Hadoop datanode service

Start the Hadoop secondarynamenode service

Start the Hadoop resourcemanager service

Start the Hadoop nodemanager service

Start the Hadoop historyserver service

25 thoughts on “Install Cloudera Hadoop (CDH4) with YARN (MRv2) in Pseudo mode on Ubuntu 12.04 LTS”

Leave a reply to akbarsahmed Cancel reply

Introduction

Installation prerequisites

Verify that Java is installed correctly

Download the CDH4 package

Install the CDH4 package

Install the Cloudera Public GPG Key

Install CDH4 with YARN in Pseudo mode

View the installed files

Format the HDFS filesystem

Start HDFS

About the HDFS filesystem

Create the HDFS /tmp directory

Create a user directory

Create the /var/log/hadoop-yarn directory

Create the staging directory

Create the done_intermediate directory

Verify that the directory structure is setup correctly

Start YARN

Run an example application with YARN

How to start the various Hadoop services

Start the Hadoop namenode

Start the Hadoop datanode service

Start the Hadoop secondarynamenode service

Start the Hadoop resourcemanager service

Start the Hadoop nodemanager service

Start the Hadoop historyserver service

Share this:

Related

25 thoughts on “Install Cloudera Hadoop (CDH4) with YARN (MRv2) in Pseudo mode on Ubuntu 12.04 LTS”

Leave a reply to akbarsahmed Cancel reply