All posts filed under: Hadoop

Install Zookeeper for Cloudera Hadoop (CHD4) in Pseudo mode on Ubuntu 12.04 LTS

comments 3
Hadoop

Introduction Zookeeper provides cluster management for Hadoop. In this post, I’m going to install Zookeeper in Pseudo mode, so please use these instructions for setting up a developer’s workstation, not for a production cluster. Installation The zookeeper package should already be installed, but we’ll double check. sudo apt-get install zookeeper Next, we’ll install the Zookeeper Server. sudo apt-get install zookeeper-server The following files are now installed: /etc/zookeeper/conf/zoo.cfg: Zookeeper configuration file sudo service zookeeper-server stop sudo […]

HBase 0.92.1 Files and Directories (CDH4)

Leave a comment
Hadoop

Introduction You will need to know the location of binaries, configuration files, and libraries when working with HBase. Directories Configuration /etc/hbase/conf is the location for all of HBase’s configuration files. HBase uses Debian Alternatives, so there are a number of symlinks to the configuration files. /etc/hbase/conf is a symlink to /etc/alternatives/hbase-conf. /etc/alternatives/hbase-conf is a symlink to /etc/hbase/conf.dist Logs /var/log/hbase contains all of the HBase log files. Files Configuration Files The following configuration files are located […]

Zookeeper 3.4.3 Files and Directories (CDH4)

Leave a comment
Hadoop

Introduction You will need to know the location of binaries, configuration files, and libraries when working with Zookeeper. Zookeeper 3.4.3 is a part of Cloudera Distribution Hadoop (CDH4). Directories /etc/zookeeper/conf /etc/zookeeper/conf is the location for all of Zookeeper’s configuration files. Zookeeper uses Debian Alternatives, so there are a number of symlinks to the configuration files. /etc/zookeeper/conf is a symlink to /etc/alternatives/zookeeper-conf. /etc/alternatives/zookeeper-conf is a symlink to /etc/zookeeper/conf.dist Files Configuration Files The following configuration files are […]

Find and Replace Text with sed

comment 1
Hadoop / Linux

Introduction sed provides a quick and easy way to find and replace text via it’s search command (‘s’). Sample File Copy and paste the following text into a file named practice01.txt. Author: Akbar S. Ahmed Date: July 1, 2012 Subject: Sed sed is an extremely useful Unix/Linux/*nix utility that allows you to manipulate a text stream. It is useful when working with Hadoop, as sed is often used to manipulate text prior to MapReduce. sed […]

What is sed?

Leave a comment
Hadoop / Linux

Introduction sed is short for Stream EDitor, which is a utility that allow you to parse and transform text one line at a time. sed is a useful tool, along with grep and awk, when manipulating text files. It is also often overlooked when working with Hadoop, although the use of sed, awk and grep can help speed up processing times by preprocessing text before sending it to a MapReduce job.

Change the Hadoop MapReduce v2 (YARN) ShuffleHandler Port

comment 1
Hadoop

Introduction If you are running Hadoop on a development machine, then it’s likely that you’ll run into a situation where multiple services require port 8080. I recently ran into this issue where both the Pentaho User Console and the Hadoop MapReduce ShuffleHandler were trying to use port 8080. One solution is to change the port used by the Hadoop MapReduce ShuffleHandler, which is what I’m going to configure below. Configuration sudo vi /etc/hadoop/conf/mapred-site.xml Add the […]

Install Sqoop 1.4.1 for Cloudera Hadoop (CHD4) on Ubuntu 12.04 LTS

comment 1
Hadoop

Introduction Sqoop is a tool to import data from an SQL database into Hadoop and/or export data from Hadoop into an SQL database. Sqoop can import/export from HDFS, HBase and Hive. It’s extremely common to use SQL databases as part of the setup in for Hadoop. Often, a SQL database will serve as an upstream datasource, such as a persistence layer for an MQ server, and as a downstream repository, such as a datamart in […]

Install Cloudera Hadoop (CDH4) with YARN (MRv2) in Pseudo mode on Ubuntu 12.04 LTS

comments 25
Hadoop

Introduction These instructions cover a manual installation of the Cloudera CDH4 packages on Ubuntu 12.04 LTS and are based on my following the Cloudera CDH4 Quick Start Guide (CDH4_Quick_Start_Guide_4.0.0.pdf). Installation prerequisites sudo apt-get install curl Verify that Java is installed correctly First, check that Java is setup correctly for your account. echo $JAVA_HOME The output should be: "/usr/lib/jvm/jdk1.6.0_31" Next, check that the JAVA_HOME environment variable is setup correctly for the sudo user. sudo env | […]

Create a .bash_aliases file

comments 3
Hadoop / Linux / Pentaho

Introduction This is my personal .bash_aliases file that is mainly used for Cloudera CDH4 (Hadoop) and Pentaho. As a result, many of my aliases are specific to these software packages. I plan to update this post as my .bash_aliases file expands. I will also push my .bash_aliases file into Git to make it easier to keep up with changes to the file. How to create a .bash_aliases file vi ~/.bash_aliases Paste the following into the […]