hdfs dfsadmin -report

Introduction

hdfs dfsadmin -report outputs a brief report on the overall HDFS filesystem. It’s a userful command to quickly view how much disk is available, how many datanodes are running, and so on.

Command

Run the command with sudo -u hdfs prefixed to ensure you don’t get a permission denied error. CDH4 runs the namenode as the hdfs user by default. However if you have changed the

sudo -u hdfs hdfs dfsadmin -report

You will see output similar to:


Configured Capacity: 247241674752 (230.26 GB)
Present Capacity: 221027041280 (205.85 GB)
DFS Remaining: 221026717696 (205.85 GB)
DFS Used: 323584 (316 KB)
DFS Used%: 0%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Datanodes available: 1 (1 total, 0 dead)

Live datanodes:
Name: 127.0.0.1:50010 (localhost)
Hostname: freshstart
Decommission Status : Normal
Configured Capacity: 247241674752 (230.26 GB)
DFS Used: 323584 (316 KB)
Non DFS Used: 26214633472 (24.41 GB)
DFS Remaining: 221026717696 (205.85 GB)
DFS Used%: 0%
DFS Remaining%: 89.4%
Last contact: Sat Jul 14 18:07:18 PDT 2012

Depricated Command

hadoop dfsadmin -report is a deprecated command. If you enter hadoop dfsadmin -report, you will see the report with the following note at the top of the output.


DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Advertisements

Saying bye to Google Wallet

I’ve been “trying” to use Google Wallet for a couple of weeks now, and I’ve pretty much given up. When it works it’s great…but that’s the problem. You never know when it’ll actually work.

First, it’s anybody’s guess if an in-store reader will even work. The same reader may work in the morning, then completely fail to work in the evening. If you go to 2 different Peet’s locations then you may be able to buy a coffee with Google Wallet at one, and then you’ll have to pull out your credit card at the other.

Second, the Google Wallet app fails to open on the Galaxy Nexus from time to time, which is always fun when you have a long line behind you (solution: put the phone away and pull out a credit card).

Google is in the habit of releasing buggy software then iterating quickly. But Wallet is different, this is money they are working with and errors are not acceptable. Wallet is a great idea that’s poorly implemented.

With that said, I bid adieu to Wallet and start my wait for someone else to release a better digital credit card app.

Install HBase 0.92.1 for Cloudera Hadoop (CHD4) in Pseudo mode on Ubuntu 12.04 LTS

Introduction

HBase is a tabular-oriented database that runs on top of HDFS. It is modeled on Google’s BigTable.

In this post, I’m going to install HBase in Pseudo mode, so please use these instructions for setting up a developer’s workstation, not for a production cluster.

When should you use HBase

HBase should be used when you need random read/write access to the data in Hadoop. While HBase gives you random seeks, it does so at the expense of performance vs. HDFS. Therefore, it is important to look at your workload and pick the correct solution for your specific requirements.

Install Zookeeper

Install Zookeeper before installing HBase.

Install Prerequisites

sudo apt-get install ntp libopts25

Installation

sudo apt-get install hbase

Let’s see what files were installed. I have written an HBase Files and Directories post that contains more information about what’s installed with the hbase package.

dpkg -L hbase | less
sudo apt-get install hbase-master

Next, we’ll stop the HBase Master.

sudo service hbase-master stop

Configure HBase to run in pseudo mode

Let’s check the hostname and port used by the HDFS Name Node.

grep -A 1 fs.default.name /etc/hadoop/conf.pseudo/core-site.xml | grep value

You should see output of:
<value>hdfs://localhost:8020</value>

cd /etc/hbase/conf; ls -l
sudo vi hbase-site.xml

Paste the following into hbase-site.xml, between <configuration> and </configuration>.


  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
  </property>
  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://localhost:8020/hbase</value>
  </property>

Add the /hbase directory to HDFS

Important
The following commands assume that you’ve followed the instructions in my post on how to Create a .bash_aliases file.

shmkdir /hbase
shchown hbase /hbase

Let’s check that the /hbase directory was created correctly in HDFS.

hls /

You should see output that includes a line for the /hbase directory.

Start the HBase Master

sudo service hbase-master start

Install an HBase Region Server

The HBase Region Server is started automatically when you install it in Ubuntu.

sudo apt-get install hbase-regionserver

Check that HBase is Setup Correctly

sudo /usr/lib/jvm/jdk1.6.0_31/bin/jps

You should see output similar to the following (look for QuorumPeerMain, NameNode, DataNode, HRegionServer, and HMaster):


1942   SecondaryNameNode
12783  QuorumPeerMain
1747   NameNode
1171   DataNode
15034  HRegionServer
14755  HMaster
2396   NodeManager
2497   ResourceManager
2152   JobHistoryServer
15441  Jps

Open http://localhost:60010 in a web browser to verify that the HBase Master was installed correctly.

If everything installed correctly then you should see the following:

  • In the Region Servers section, there should be one line for localhost.
  • In the Attributes section, you should see HBase Version = 0.92.1-cdh4.0.0.

Add the JDK 1.6.0 u31 Path to BigTop

This update is required as BigTop uses a fixed array approach to finding JAVA_HOME.

sudo vi /usr/lib/bigtop-utils/bigtop-detect-javahome

Add the following line just below the for candidate in \ line:

/usr/lib/jvm/jdk1.6.0_31 \

Update the hosts file

It’s likely that you’ll get an error due to the localhost loopback.

Update the /etc/hosts file (note: The page that contains these instructions was originally written during HBase debugging).

That’s it. You now have HBase installed and ready for use on a developer’s workstation/laptop.

Additional Reading

There are some additional configuration options for HBase, including:

Install Zookeeper for Cloudera Hadoop (CHD4) in Pseudo mode on Ubuntu 12.04 LTS

Introduction

Zookeeper provides cluster management for Hadoop.

In this post, I’m going to install Zookeeper in Pseudo mode, so please use these instructions for setting up a developer’s workstation, not for a production cluster.

Installation

The zookeeper package should already be installed, but we’ll double check.

sudo apt-get install zookeeper

Next, we’ll install the Zookeeper Server.

sudo apt-get install zookeeper-server

The following files are now installed:
/etc/zookeeper/conf/zoo.cfg: Zookeeper configuration file

sudo service zookeeper-server stop
sudo service zookeeper-server start

If you have installed Zookeeper before installing HBase, you will see the following error message:

Using config: /etc/zookeeper/conf/zoo.cfg
ZooKeeper data directory is missing at /var/lib/zookeeper fix the path or run initialize
invoke-rc.d: initscript zookeeper-server, action "start" failed.

You need to initialize Zookeeper when it’s installed before HBase.

sudo service zookeeper-server init

Now you can start Zookeeper.

sudo service zookeeper-server start

Additional Reading

HBase 0.92.1 Files and Directories (CDH4)

Introduction

You will need to know the location of binaries, configuration files, and libraries when working with HBase.

Directories

Configuration

/etc/hbase/conf is the location for all of HBase’s configuration files.

HBase uses Debian Alternatives, so there are a number of symlinks to the configuration files.

/etc/hbase/conf is a symlink to /etc/alternatives/hbase-conf.
/etc/alternatives/hbase-conf is a symlink to /etc/hbase/conf.dist

Logs

/var/log/hbase contains all of the HBase log files.

Files

Configuration Files

The following configuration files are located in /etc/hbase/conf

hadoop-metrics.properties

TODO

hbase-env.sh

TODO

hbase-policy.xml

TODO

hbase-site.xml

TODO

log4j.properties

TODO

regionservers

TODO

Zookeeper 3.4.3 Files and Directories (CDH4)

Introduction

You will need to know the location of binaries, configuration files, and libraries when working with Zookeeper.

Zookeeper 3.4.3 is a part of Cloudera Distribution Hadoop (CDH4).

Directories

/etc/zookeeper/conf

/etc/zookeeper/conf is the location for all of Zookeeper’s configuration files.

Zookeeper uses Debian Alternatives, so there are a number of symlinks to the configuration files.

/etc/zookeeper/conf is a symlink to /etc/alternatives/zookeeper-conf.
/etc/alternatives/zookeeper-conf is a symlink to /etc/zookeeper/conf.dist

Files

Configuration Files

The following configuration files are located in /etc/zookeeper/conf

configuration.xsl

log4j.properties

zoo.cfg

zoo.cfg is the main Zookeeper configuration file.

dataDir

dataDir specifies the directory where znode snapshot files and transaction logs are stored. These files are important as you will need them to recover data.

The files located in dataDir should be backed up regularly.

zoo_sample.cfg

A sample configuration file. One of the more interesting notes is about the autopurge.snapRetainCount configuration variable (http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance).

Init Files

zookeeper-server

Use the init script to start, stop, restart, check the status of zookeeper, and initial zookeeper.

Binaries and Scripts

/usr/lib/zookeeper/bin/zkCleanup.sh

Script that cleans up the files created in dataDir. This script should be modified per installation and should be added to cron for periodic cleanup.

Google Drive

Introduction

Google Drive is a combination of cloud storage and local disk synchronization services.

Google Drive allows you to use Google’s servers as your primary file storage, and then sync those files to one or more devices. Supported devices include Windows (XP, Vista, 7), Mac OS X, iPad, iPhone, and Android (tables and phones).

Google Drive differs from sharing oriented services where the file is uploaded to a server to be shared (like a browser based FTP service such as Box). It also differs from pure sync services where the file is never actually stored on a server, but is simply synchronized between multiple devices.

What’s good about Google Drive?

In 2 words: Price, Google.

As of July 5, 2012, Google Drive has lower prices than many of the alternatives. Also, if you’re a heavy user of Google’s other services (like me), then the integration with other Google services is excellent. Personally, I’m find the ability to email a link to a file in Google Drive to be a superior experience to attaching a large file to an email.

While the “what’s good” section is small, the short version is that Google Drive does what it is supposed to do. It syncs your files to Google’s servers and keeps your devices up to date.

What’s not so good about Google Drive?

The upload problem

I have been using Google Drive since it was first released, and it’s still uploading files. Unless you have huge upload bandwidth, Google Drive suffers from the same problem as all cloud based storage services. That is, how do you get 100 GB, 200 GB or more into the cloud. Unfortunately, we live in a download world, and most people have a faster download than upload.

Bugs and other functionality failures

I have used Google Drive on a number of different platforms, including 2 Windows XP machines, 1 Windows 7 machine, and 2 Android phones (Galaxy Nexus and Droid Razor Maxx).

Of all the platforms, Windows 7 has been the least buggy, but it’s still not perfect.

Windows Explorer crashes (consistently) on Windows XP when you browse the Google Drive folder. I’m not the only one to experience this problem: http://productforums.google.com/forum/#!topic/drive/fBaZxY5QUBc

Google Drive is a bandwidth hog

Google Drive uses all of the bandwidth that your laptop/desktop has access to. If you only have 1 machine on a home network and you’re syncing a big file, say a 1GB home video, then you’ll notice that other web pages open slowly. Of course, if you have 2 more people on a home or small business network, then other users will complain of that their Internet access is slow.

To date, Google Drive is not allowed on our corporate network, and I agree with the policy. I’d hate to see what happens if 10, 20 or more people are all syncing large files at the same time. While the bandwidth per client is controlled on a corporate network, it still means that each person with Google Drive will suffer with a slow Internet experience (until the sync is done).

Missing features

There are some important/useful features that are currently missing from Google Drive, including:

  • Manual sync: There is currently no way to force Google Drive to sync.
  • File size: Is that a 1MB file or a 1GB file? You’ll never know in Google Drive’s web interface, so have fun figuring out the download time. You’ll notice this problem when you sync your home movies.
  • Child folder sync: While you can use the Preferences to select which folders to sync, you can only select/deselect top level folders. There is no way to sync a subset of a folder.
  • List of completed/in process files: When you add files to Google Drive, you’ll see a status message of 1 of 125 files synced. However, there is no way to see what is done, and what is in process.
  • Bandwidth management: As mentioned elsewhere in this post, Google Drive will use all of the bandwidth that’s given to it. So, you’ll need to use your router to control how much bandwidth is made available to Drive.

Should you use Google Drive in your business?

The short answer is no. At least not yet.

If you have Windows XP, then you should wait until Google fixes the bugs with Windows Explorer. You don’t want users complaining that Windows Explorer crashes every time they view files in their Google Drive folder.

I’d also wait until Google adds bandwidth controls.

Lastly, local sync is a required feature for a business use case, where most users will be on the same network. (DropBox has has this feature, called LAN Sync, for a while.)

Alternatives to Google Drive

There are a lot of alternatives when looking into cloud storage and synchronization services including:

Summary

Overall, Google Drive is a good solution at an attractive price. There is room for improvement, but if you’re a big Google user then it’s an easy winner.

Find and Replace Text with sed

Introduction

sed provides a quick and easy way to find and replace text via it’s search command (‘s’).

Sample File

Copy and paste the following text into a file named practice01.txt.


Author: Akbar S. Ahmed
Date: July 1, 2012
Subject: Sed

sed is an extremely useful Unix/Linux/*nix utility that allows you to manipulate a text stream. It is useful when working with Hadoop, as sed is often used to manipulate text prior to MapReduce.

sed practice

name Akbar
state California
state CA
OS Linux, OS X, Windows
blog http://akbarahmed.com

Substitution (Find and Replace)

The main sed command that you’ll use frequently is s, which stands for substitute.

Let’s start with a basic example.

Substitute Linux with Ubuntu

sed -i 's/Linux/Ubuntu/' practice01.txt

If you’re using a Mac, then you’ll need to adjust the command listed above to work with the BSD version of sed. Fortunately, this command also works in Ubuntu.

sed 's/Linux/Ubuntu/' practice01.txt > practice01-output.txt

Let’s check our work.

cat practice01-output.txt

It’s important to understand each component of a command, including the options. In our command above we used the following:

  • sed: This is the sed utility
  • -i: “In place”. -i means edit and save changes to the same file. In the two commands above, you’ll notice that we have to use > somefile to redirect the output when we don’t use -i.
  • s: Substitute. The first word (ex. Linux) is the word we want to search and replace with the second word (ex. Ubuntu).

Substitute all instances of a word

By default, sed only replaces the first instance of a word on a given line.

Create a new file named practice02.txt by running the following command.

echo "sed is a stream editor. sed is a stream editor." > practice02.txt

Let’s begin by using the command we already learned to change ‘sed’ to ‘vi’.

sed 's/sed/vi/' practice02.txt > practice02-output.txt
cat practice02-output.txt

You should see output that looks like the following:
vi is a stream editor. sed is a stream editor.

Notice how only the first instance of ‘sed’ was changed to ‘vi’.

Let’s create a new practice file by running the following command. This time we’ll create 3 lines with the same text, and we’ll append a ‘cat’ command so that we can immediately see the contents of our file.

for i in 1 2 3; do echo "editorX is a stream editor. editorX is a stream editor." >> practice03.txt; done; cat practice03.txt

To make a global substitution (find and replace all), we need to add the ‘g’ command to ‘s’.

sed 's/editorX/editorY/g' practice03.txt > practice03-output.txt
cat practice03-output.txt

Limiting which lines are edited

sed allows us to easy control which lines are edited. For example, if our data has a header row in the first row, then we can limit editing to only the first row.

sed '1s/editorX/myEditor/g' practice03.txt > practice03a-output.txt
cat practice03a-output.txt

Let’s now edit lines 2 to 3 only.

sed '2,3s/editorX/yourEditor/g' practice03.txt > practice03b-output.txt
cat practice03b-output.txt

Wrap every line in double quotes

This next command is important because it higlights the fact that you can use regex with sed. In fact, the use of regex with sed provides you with an extremely powerful tool to edit files.

sed 's/.*/"&"/g' practice03.txt > practice03c-output.txt

While this post provides a quick into to sed, it’ll be worth your while to learn it in detail as sed is a core part of Linux’s text processing capabilities. Further, sed is an extremely useful tool to preprocess files before submitting them to a MapReduce job in Hadoop.

What is sed?

Introduction

sed is short for Stream EDitor, which is a utility that allow you to parse and transform text one line at a time. sed is a useful tool, along with grep and awk, when manipulating text files. It is also often overlooked when working with Hadoop, although the use of sed, awk and grep can help speed up processing times by preprocessing text before sending it to a MapReduce job.