Introduction
These instructions cover a manual installation of the Cloudera CDH4 packages on Ubuntu 12.04 LTS and are based on my following the Cloudera CDH4 Quick Start Guide (CDH4_Quick_Start_Guide_4.0.0.pdf).
Installation prerequisites
sudo apt-get install curl
Verify that Java is installed correctly
First, check that Java is setup correctly for your account.
echo $JAVA_HOME
The output should be:
"/usr/lib/jvm/jdk1.6.0_31"
Next, check that the JAVA_HOME environment variable is setup correctly for the sudo user.
sudo env | grep JAVA_HOME
The output should be:
JAVA_HOME="/usr/lib/jvm/jdk1.6.0_31"
Download the CDH4 package
cd ~/Downloads
mkdir cloudera
cd cloudera
There was no good way of wrapping the link below, so I added an HTML link to the post. This way you can right-click the link and click Copy Link Location, which you can then use to paste into the terminal.
wget http://archive.cloudera.com/cdh4/one-click-install/precise/amd64/cdh4-repository_1.0_all.deb
Install the CDH4 package
sudo dpkg -i cdh4-repository_1.0_all.deb
Install the Cloudera Public GPG Key
curl -s \ http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key \ | sudo apt-key add -
sudo apt-get update
Install CDH4 with YARN in Pseudo mode
sudo apt-get install hadoop-conf-pseudo
This one command will install a large number of packages:
- bigtop-jsvc
- bigtop-utils
- hadoop
- hadoop-conf-pseudo
- hadoop-hdfs
- hadoop-hdfs-datanode
- hadoop-hdfs-namenode
- hadoop-hdfs-secondarynamenode
- hadoop-mapreduce
- hadoop-mapreduce-historyserver
- hadoop-yarn
- hadoop-yarn-nodemanager
- hadoop-yarn-resourcemanager
- zookeeper
View the installed files
It’s good practice to view the list of files installed by each package. Specifically, this is a good method to learn about all of the available configuration files.
dpkg -L hadoop-conf-pseudo
Included in the list of files displayed by dpkg are the configuration files (and some other files):
/etc/hadoop/conf.pseudo/yarn-site.xml
/etc/hadoop/conf.pseudo/log4j.properties
/etc/hadoop/conf.pseudo/hdfs-site.xml
/etc/hadoop/conf.pseudo/hadoop-metrics.properties
/etc/hadoop/conf.pseudo/mapred-site.xml
/etc/hadoop/conf.pseudo/README
/etc/hadoop/conf.pseudo/hadoop-env.sh
/etc/hadoop/conf.pseudo/core-site.xml
Format the HDFS filesystem
sudo -u hdfs hdfs namenode -format
I received one warning message when I formatted the HDFS filesystem:
WARN common.Util: Path /var/lib/hadoop-hdfs/cache/hdfs/dfs/name should be specified as a URI in configuration files. Please update hdfs configuration.
Start HDFS
cd
mkdir bin
cd bin
vi hadoop-hdfs-start
Paste the following code in hadoop-hdfs-start:
#!/bin/bash
for service in /etc/init.d/hadoop-hdfs-*
do
sudo $service start
done
chmod +x hadoop-hdfs-start
./hadoop-hdfs-start
Open the NameNode web console at http://localhost:50070.
About the HDFS filesystem
The commands in the sections below are for creating directories in the HDFS filesystem. Importantly, the HDFS directory structure is not the same as the directory structure in ext4 (i.e. your main Linux directory structure).
To view the HDFS directory structure, you basically prefix standard Linux commands with sudo -u hdfs hadoop fs –. Therefore, you will likely find it useful to create a .bash_aliases file that provides an easier way to type these commands.
I have created a sample .bash_aliases file in the following post: Create a .bash_aliases file
I have used my aliases to setup Hadoop as it’s easier to type. For example, I use hls instead of sudo -u hdfs hadoop fs -ls.
Create the HDFS /tmp directory
You don’t need to remove an old /tmp directory if this is the first time you’re installing Hadoop, but I’ll include the command here for completeness.
sudo -u hdfs hadoop fs -rm -r /tmp
Let’s create a new /tmp directory in HDFS.
shmkdir /tmp
Next, we’ll update the permissions on /tmp in HDFS.
shchmod -R 1777 /tmp
Create a user directory
Since this is a setup for development, we will only create one user directory. However, for a cluster or multi-user environment, you should create one user directory per MapReduce user.
Change akbar below to your username.
shmkdir /user/akbar
shchown akbar:akbar /user/akbar
Create the /var/log/hadoop-yarn directory
shmkdir /var/log/hadoop-yarn
shchown yarn:mapred /var/log/hadoop-yarn
Create the staging directory
The Hadoop -mkdir command defaults to -p (create parent directory).
shmkdir /tmp/hadoop-yarn/staging
shchmod -R 1777 /tmp/hadoop-yarn/staging
Create the done_intermediate directory
shmkdir /tmp/hadoop-yarn/staging/history/done_intermediate
shchmod -R 1777 /tmp/hadoop-yarn/staging/history/done_intermediate
shchown -R mapred:mapred /tmp/hadoop-yarn/staging
Verify that the directory structure is setup correctly
shls -R /
The output should look like:
drwxrwxrwt - hdfs supergroup 0 2012-06-25 15:11 /tmp
drwxr-xr-x - hdfs supergroup 0 2012-06-25 15:11 /tmp/hadoop-yarn
drwxrwxrwt - mapred mapred 0 2012-06-25 15:51 /tmp/hadoop-yarn/staging
drwxr-xr-x - mapred mapred 0 2012-06-25 15:51 /tmp/hadoop-yarn/staging/history
drwxrwxrwt - mapred mapred 0 2012-06-25 15:51 /tmp/hadoop-yarn/staging/history/done_intermediate
drwxr-xr-x - hdfs supergroup 0 2012-06-25 15:09 /user
drwxr-xr-x - akbar akbar 0 2012-06-25 15:09 /user/akbar
drwxr-xr-x - hdfs supergroup 0 2012-06-25 13:42 /var
drwxr-xr-x - hdfs supergroup 0 2012-06-25 13:42 /var/log
drwxr-xr-x - yarn mapred 0 2012-06-25 13:42 /var/log/hadoop-yarn
Start YARN
sudo service hadoop-yarn-resourcemanager start
sudo service hadoop-yarn-nodemanager start
sudo service hadoop-mapreduce-historyserver start
I get the following error message when I start the MR History Server:
chown: changing ownership of `/var/log/hadoop-mapreduce': Operation not permitted
However, this error is not significant and can be ignored. It’ll likely be fixed in an update to CDH4/Hadoop.
Run an example application with YARN
Important
We are going to run the sample YARN app as our regular user, so we’ll use the Hadoop aliases, such as hls, and not the sudo Hadoop aliases, such as shls.
In the first command, we’ll enter just a directory name of input. However, you’ll notice that the input is automatically created in our user directory of /user/akbar/input.
hmkdir input
Let’s view our new directory in HDFS.
hls
Or, you can optionally specify the complete path.
hls /user/akbar
Next, we’ll put some files into the HDFS /user/akbar/input directory.
hadoop fs -put /etc/hadoop/conf/*.xml input
Set the HADOOP_MAPRED_HOME environment variable for the current user in our current session.
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
Next, we’ll run the sample job which is a simple grep of the file in the input directory, which then outputs the results to the output directory. It’s worth noting that that .jar file is located on the physical filesystem, while the input and output directories in the HDFS filesystem.
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \ grep input output 'dfs[a-z.]+'
hls
hls output
hcat output/part-r-00000
If you have a longer file to view, piping into less will prove helpful, such as:
hcat output/part-r-00000 | less
How to start the various Hadoop services
The following are some additional notes I took on starting the various Hadoop services (in the correct order).
Start the Hadoop namenode
sudo service hadoop-hdfs-namenode start
Start the Hadoop datanode service
sudo service hadoop-hdfs-datanode start
Start the Hadoop secondarynamenode service
sudo service hadoop-hdfs-secondarynamenode start
Start the Hadoop resourcemanager service
sudo service hadoop-yarn-resourcemanager start
Start the Hadoop nodemanager service
sudo service hadoop-yarn-nodemanager start
Start the Hadoop historyserver service
sudo service hadoop-mapreduce-historyserver start
after I run “sudo apt-get install hadoop-conf-pseudo”
—–
Reading package lists… Done
Building dependency tree
Reading state information… Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:
The following packages have unmet dependencies:
hadoop-conf-pseudo : Depends: hadoop-hdfs-namenode (= 2.0.0+91-1.cdh4.0.1.p0.1~precise-cdh4.0.1) but it is not going to be installed
Depends: hadoop-hdfs-datanode (= 2.0.0+91-1.cdh4.0.1.p0.1~precise-cdh4.0.1) but it is not going to be installed
Depends: hadoop-hdfs-secondarynamenode (= 2.0.0+91-1.cdh4.0.1.p0.1~precise-cdh4.0.1) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.
I get the above exception, do you have any idea why ?
Thanks,
Hi Nadir,
I’m not sure of the exact cause, but the following commands may help to solve the problem.
sudo apt-get install -f
Then try to install hadoop-conf-pseudo again:
sudo apt-get install hadoop-conf-pseudo
Akbar
Hi Akbar,
I am getting the same error and install -f did not help either 😦
Is there a way to manually these three base packages that could not installed manually and then try the hadoop-conf-pseudo again?
Thanks,
Behzad
its because your ubuntu is 32 bit and CDH 4 requires 64 bit. I’ve been through this… 😦
Error while executing :
./hadoop-hdfs-start
Getting :
* Starting Hadoop datanode:
/usr/bin/env: bash: No such file or directory
* Starting Hadoop namenode:
/usr/bin/env: bash: No such file or directory
* Starting Hadoop secondarynamenode:
/usr/bin/env: bash: No such file or directory
localhost not working after this.
Which OS and version are you using?
I am using ubuntu 12.
I set path in /etc/environment .
Check your quotes in /etc/environment…you may have magical quotes.
sudo apt-get install hadoop-conf-pseudo
Reading package lists… Done
Building dependency tree
Reading state information… Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:
The following packages have unmet dependencies:
hadoop-conf-pseudo : Depends: hadoop-hdfs-namenode (= 2.0.0+552-1.cdh4.1.2.p0.27~precise-cdh4.1.2) but it is not going to be installed
Depends: hadoop-hdfs-datanode (= 2.0.0+552-1.cdh4.1.2.p0.27~precise-cdh4.1.2) but it is not going to be installed
Depends: hadoop-hdfs-secondarynamenode (= 2.0.0+552-1.cdh4.1.2.p0.27~precise-cdh4.1.2) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.
how to fix dis??
i had installed hadoop-0.20 early and uninstalled it
The first step in fixing the error is to correct the issues reported by apt.
You need to install:
hadoop-hdfs-namenode
hadoop-hdfs-datanode
hadoop-hdfs-secondarynamenode
Hi Akbar
Excellent step-by-step tutorials you’ve got here!
Hoping you can help a little on the following as I am stuck at the “Install the Cloudera Public GPG Key” step.
I’ve done the
“curl -s \
http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key \
| sudo apt-key add -”
And then, sudo apt-get update.
I checked my Download directory and it now has a .deb file.
But, the problem is the next step:
sudo apt-get install hadoop-conf-pseudo
It is complaining that the “hadoop-conf-pseudo” does not exists, therefore, install failed.
I’ve followed each lines of your codes above and description carefully, including the installation of JDK on your blog. Would you have any idea what might have gone wrong?
Hi Dan,
Did you install cdh4-repository_1.0_all.deb?
Run the following:
ls
Do you see cdh4-repository_1.0_all.deb? If so, then continue.
sudo dpkg -i cdh4-repository_1.0_all.deb
One this is done, you should be able to run the following.
sudo apt-get install hadoop-conf-pseudo
Also, it looks like Cloudera is running some maintenance right now (3 am PST in California)…so that could be the problem if everything else is fine.
Lastly, I wrote this post a while ago, so there is an update to CDH that you can download from the Cloudera site.
Hope this helps.
Akbar
Hi Dan Teok,
Facing similar issue, Did you resolve the issue ?
@SyedArif
Hi Syed, No, I haven’t been able to resolve the issue. Akbar still has not replied to my last comment. I have searched the Internet but to no avail.
Hi Akbar, Thanks for the reply above.
Now I have the time to revisit this, and yes, there was a cdh4-repository_1.0_all.deb.
It still says Unable to locate package hadoop-conf-pseudo. I’m still stuck at this point. My timezone here is GMT (London)
Here is the terminal output:
user_dan@user_dan-Production:~/Downloads/cloudera$ ll
total 12
drwxrwxr-x 2 user_dan user_dan 4096 Feb 24 20:06 ./
drwxr-xr-x 3 user_dan user_dan 4096 Feb 24 20:06 ../
-rw-rw-r– 1 user_dan user_dan 3304 Dec 1 01:51 cdh4-repository_1.0_all.deb
user_dan@duser_danntvli-Production:~/Downloads/cloudera$ sudo dpkg -i cdh4-repository_1.0_all.deb
[sudo] password for user_dan:
Selecting previously unselected package cdh4-repository.
(Reading database … 171165 files and directories currently installed.)
Unpacking cdh4-repository (from cdh4-repository_1.0_all.deb) …
Setting up cdh4-repository (1.0) …
gpg: keyring `/etc/apt/secring.gpg’ created
gpg: keyring `/etc/apt/trusted.gpg.d/cloudera-cdh4.gpg’ created
gpg: key 02A818DD: public key “Cloudera Apt Repository” imported
gpg: Total number processed: 1
gpg: imported: 1
user_dan@user_dan-Production:~/Downloads/cloudera$ sudo apt-get install hadoop-conf-pseudo
Reading package lists… Done
Building dependency tree
Reading state information… Done
E: Unable to locate package hadoop-conf-pseudo
user_dan@user_dan-Production:~/Downloads/cloudera$
Hi Dan,
Sorry for the slow reply…work has been hectic.
It’s a bit hard to diagnose from the log messages.
I’ll try to find time in the next few days to install the latest CDH version. Once that’s done, I’ll write it up and send you an email.
One other note, I noticed your hostname has the word ‘Production’ in it, yet you’re installing YARN. YARN is approaching production status, but it’s not there yet. Also, I would not want to be the first one out of the gate to deploy YARN until many of the kinks have been worked out.
Yahoo is rolling our YARN currently, as are some startups with former FB employees, so YARN will get some early production use in the months ahead. IMO, the data layer is rarely the place to take risks on the latest technology.
Thanks,
Akbar
Hi Akbar,
Thank you for responding. I look forward to receiving your email.
Replying to your note: the word ‘production’ is just what I called my machine when Ubuntu asked to give it a name. I didn’t want to have the usual ‘…@dan-pc’ or ‘…@myubuntupc’, etc.
Looking forward.
D
Hi Akbar,
Guess you have replied the issue faced by “Dan Teok “, i am facing similar issue.
Kindly share the details… 🙂
Hi Akbar,
I verified Java installation. I had installated Oracle java 6 update 45 downloaded from web8 website. Now I have chnged it to Oracle jdk1.6.0_31. But still I am getting same complain while installing hbase.
error:
invoke-rc.d: initscript hbase-master, action “start” failed.
dpkg: error processing hbase-master (–configure):
subprocess installed post-installation script returned error exit status 1
Setting up hbase-regionserver (0.94.2+218-1.cdh4.2.1.p0.8~precise-cdh4.2.1) …
Starting Hadoop HBase regionserver daemon: +======================================================================+
| Error: JAVA_HOME is not set and Java could not be found |
+———————————————————————-+
| Please download the latest Sun JDK from the Sun Java web site |
| > http://java.sun.com/javase/downloads/ < |
| |
| HBase requires Java 1.6 or later. |
| NOTE: This script will find Sun Java whether you install using the |
| binary or the RPM based installer.
MY OBSERVATION: When I write sudo env | grep JAVA_HOME, I donot get any reply. But when I mention env | grep JAVA_HOME I get the JAVA_HOME path. I feel I have modified bashrc file of sush (only user ) and not that of sudo.
I have modified /etc/profile for all users but unable to run it using ~$. /etc/profile
Note that there is only one user "sush". Pl. help me to install jdk for root user.
Thanks for the instructions. Very clear and exact; much easier to follow than the Cloudera instructions 🙂
Hi Akbar,
I sucessfully installed CDH4 in Ubuntu 12.04. Unfortunately while purging zookeeper and reinstalling hadoop CDH4 got currupted. I reinstalled CDH4 after purging the old one but the datanode is not getting started. All nodes are getting started doing sudo jps. But when I mention localhost:50070 in web-browser, the live node is 0. Is there any method of reinstalling datanode or I have to remove hadoop using Synaptic Manager and again reinstall and configure.
Pl. help.
Sushant
Hi Akbar,
This was the most comprehensive guide for Hadoop Install:
However, could you please elaborate on this:
“I received one warning message when I formatted the HDFS filesystem:
WARN common.Util: Path /var/lib/hadoop-hdfs/cache/hdfs/dfs/name should be specified as a URI in configuration files. Please update hdfs configuration.”
I have been trying to configure Cloudera Hadoop in Pseudo mode. However, I am stuck on the above.
How do I configure HDFS?
Thank you,
Run dpkg -L hadoop-0.20-conf-pseudo instead of
dpkg -L hadoop-conf-pseudo