Introduction
These instructions cover a manual installation of the Cloudera CDH4 packages on Ubuntu 12.04 LTS and are based on my following the Cloudera CDH4 Quick Start Guide (CDH4_Quick_Start_Guide_4.0.0.pdf).
Installation prerequisites
sudo apt-get install curl
Verify that Java is installed correctly
First, check that Java is setup correctly for your account.
echo $JAVA_HOME
The output should be:
"/usr/lib/jvm/jdk1.6.0_31"
Next, check that the JAVA_HOME environment variable is setup correctly for the sudo user.
sudo env | grep JAVA_HOME
The output should be:
JAVA_HOME="/usr/lib/jvm/jdk1.6.0_31"
Download the CDH4 package
cd ~/Downloads
mkdir cloudera
cd cloudera
There was no good way of wrapping the link below, so I added an HTML link to the post. This way you can right-click the link and click Copy Link Location, which you can then use to paste into the terminal.
wget http://archive.cloudera.com/cdh4/one-click-install/precise/amd64/cdh4-repository_1.0_all.deb
Install the CDH4 package
sudo dpkg -i cdh4-repository_1.0_all.deb
Install the Cloudera Public GPG Key
curl -s \
http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key \
| sudo apt-key add -
sudo apt-get update
Install CDH4 with YARN in Pseudo mode
sudo apt-get install hadoop-conf-pseudo
This one command will install a large number of packages:
- bigtop-jsvc
- bigtop-utils
- hadoop
- hadoop-conf-pseudo
- hadoop-hdfs
- hadoop-hdfs-datanode
- hadoop-hdfs-namenode
- hadoop-hdfs-secondarynamenode
- hadoop-mapreduce
- hadoop-mapreduce-historyserver
- hadoop-yarn
- hadoop-yarn-nodemanager
- hadoop-yarn-resourcemanager
- zookeeper
View the installed files
It’s good practice to view the list of files installed by each package. Specifically, this is a good method to learn about all of the available configuration files.
dpkg -L hadoop-conf-pseudo
Included in the list of files displayed by dpkg are the configuration files (and some other files):
/etc/hadoop/conf.pseudo/yarn-site.xml
/etc/hadoop/conf.pseudo/log4j.properties
/etc/hadoop/conf.pseudo/hdfs-site.xml
/etc/hadoop/conf.pseudo/hadoop-metrics.properties
/etc/hadoop/conf.pseudo/mapred-site.xml
/etc/hadoop/conf.pseudo/README
/etc/hadoop/conf.pseudo/hadoop-env.sh
/etc/hadoop/conf.pseudo/core-site.xml
Format the HDFS filesystem
sudo -u hdfs hdfs namenode -format
I received one warning message when I formatted the HDFS filesystem:
WARN common.Util: Path /var/lib/hadoop-hdfs/cache/hdfs/dfs/name should be specified as a URI in configuration files. Please update hdfs configuration.
Start HDFS
cd
mkdir bin
cd bin
vi hadoop-hdfs-start
Paste the following code in hadoop-hdfs-start:
#!/bin/bash
for service in /etc/init.d/hadoop-hdfs-*
do
sudo $service start
done
chmod +x hadoop-hdfs-start
./hadoop-hdfs-start
Open the NameNode web console at http://localhost:50070.
About the HDFS filesystem
The commands in the sections below are for creating directories in the HDFS filesystem. Importantly, the HDFS directory structure is not the same as the directory structure in ext4 (i.e. your main Linux directory structure).
To view the HDFS directory structure, you basically prefix standard Linux commands with sudo -u hdfs hadoop fs –. Therefore, you will likely find it useful to create a .bash_aliases file that provides an easier way to type these commands.
I have created a sample .bash_aliases file in the following post: Create a .bash_aliases file
I have used my aliases to setup Hadoop as it’s easier to type. For example, I use hls instead of sudo -u hdfs hadoop fs -ls.
Create the HDFS /tmp directory
You don’t need to remove an old /tmp directory if this is the first time you’re installing Hadoop, but I’ll include the command here for completeness.
sudo -u hdfs hadoop fs -rm -r /tmp
Let’s create a new /tmp directory in HDFS.
shmkdir /tmp
Next, we’ll update the permissions on /tmp in HDFS.
shchmod -R 1777 /tmp
Create a user directory
Since this is a setup for development, we will only create one user directory. However, for a cluster or multi-user environment, you should create one user directory per MapReduce user.
Change akbar below to your username.
shmkdir /user/akbar
shchown akbar:akbar /user/akbar
Create the /var/log/hadoop-yarn directory
shmkdir /var/log/hadoop-yarn
shchown yarn:mapred /var/log/hadoop-yarn
Create the staging directory
The Hadoop -mkdir command defaults to -p (create parent directory).
shmkdir /tmp/hadoop-yarn/staging
shchmod -R 1777 /tmp/hadoop-yarn/staging
Create the done_intermediate directory
shmkdir /tmp/hadoop-yarn/staging/history/done_intermediate
shchmod -R 1777 /tmp/hadoop-yarn/staging/history/done_intermediate
shchown -R mapred:mapred /tmp/hadoop-yarn/staging
Verify that the directory structure is setup correctly
shls -R /
The output should look like:
drwxrwxrwt - hdfs supergroup 0 2012-06-25 15:11 /tmp
drwxr-xr-x - hdfs supergroup 0 2012-06-25 15:11 /tmp/hadoop-yarn
drwxrwxrwt - mapred mapred 0 2012-06-25 15:51 /tmp/hadoop-yarn/staging
drwxr-xr-x - mapred mapred 0 2012-06-25 15:51 /tmp/hadoop-yarn/staging/history
drwxrwxrwt - mapred mapred 0 2012-06-25 15:51 /tmp/hadoop-yarn/staging/history/done_intermediate
drwxr-xr-x - hdfs supergroup 0 2012-06-25 15:09 /user
drwxr-xr-x - akbar akbar 0 2012-06-25 15:09 /user/akbar
drwxr-xr-x - hdfs supergroup 0 2012-06-25 13:42 /var
drwxr-xr-x - hdfs supergroup 0 2012-06-25 13:42 /var/log
drwxr-xr-x - yarn mapred 0 2012-06-25 13:42 /var/log/hadoop-yarn
Start YARN
sudo service hadoop-yarn-resourcemanager start
sudo service hadoop-yarn-nodemanager start
sudo service hadoop-mapreduce-historyserver start
I get the following error message when I start the MR History Server:
chown: changing ownership of `/var/log/hadoop-mapreduce': Operation not permitted
However, this error is not significant and can be ignored. It’ll likely be fixed in an update to CDH4/Hadoop.
Run an example application with YARN
Important
We are going to run the sample YARN app as our regular user, so we’ll use the Hadoop aliases, such as hls, and not the sudo Hadoop aliases, such as shls.
In the first command, we’ll enter just a directory name of input. However, you’ll notice that the input is automatically created in our user directory of /user/akbar/input.
hmkdir input
Let’s view our new directory in HDFS.
hls
Or, you can optionally specify the complete path.
hls /user/akbar
Next, we’ll put some files into the HDFS /user/akbar/input directory.
hadoop fs -put /etc/hadoop/conf/*.xml input
Set the HADOOP_MAPRED_HOME environment variable for the current user in our current session.
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
Next, we’ll run the sample job which is a simple grep of the file in the input directory, which then outputs the results to the output directory. It’s worth noting that that .jar file is located on the physical filesystem, while the input and output directories in the HDFS filesystem.
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \
grep input output 'dfs[a-z.]+'
hls
hls output
hcat output/part-r-00000
If you have a longer file to view, piping into less will prove helpful, such as:
hcat output/part-r-00000 | less
How to start the various Hadoop services
The following are some additional notes I took on starting the various Hadoop services (in the correct order).
Start the Hadoop namenode
sudo service hadoop-hdfs-namenode start
Start the Hadoop datanode service
sudo service hadoop-hdfs-datanode start
Start the Hadoop secondarynamenode service
sudo service hadoop-hdfs-secondarynamenode start
Start the Hadoop resourcemanager service
sudo service hadoop-yarn-resourcemanager start
Start the Hadoop nodemanager service
sudo service hadoop-yarn-nodemanager start
Start the Hadoop historyserver service
sudo service hadoop-mapreduce-historyserver start