Hadoop Distributions

The following is a repost of my answer to a question on LinkedIn, but I thought it may prove useful to people evaluating Hadoop distributions.

The following is a substantially over simplified set of choices (in alphabetical order):

Amazon: Apache Hadoop provided as a web service. Good solution if your data is collected on Amazon…saves you the trouble of uploading gigs and gigs of data.

Apache: Apache Hadoop is the core code based upon which the various distributions are based.

Cloudera: CHD3 is based on Hadoop 1 (the current stable version) and CDH4 is based on Hadoop 2. CDH is based on Apache Hadoop. The only piece that’s not open source (AFAIK) is Cloudera Manager, which allows you to install up to 50 nodes for free before you go to the paid version. Cloudera is an extremely popular solution that runs on a wide variety of operating systems.

Hortonworks: HDP1 is 100% open source and is based on Hadoop 1. HDP is designed to run on RedHat/CentOS/Oracle Linux.

IBM: IBM BigInsights adds the GPFS filesystem to Hadoop, and is a good choice if your company already is an IBM shop…and you need to integrate with other IBM solutions. Free version is available as InfoSphere BigInsights Basic Edition. Basic Edition does not include all of the value add features found in Enterprise Edition (such as GPFS-SNC).

MapR: MapR uses a proprietary file system plus additional changes to Hadoop that addresses issues with the platform. They have a shared nothing architeture for the NameNode and JobTracker. MapR M3 is available for free, while M5 is a paid version with more features (such as the shared nothing NameNode). People who have used MapR tend to like it.

Advertisements

Understing the Hadoop High Availability (HA) Options

Once you start to use Hadoop in your day-to-day business operations, you’ll quickly find that uptime is an important consideration. No one wants to explain to the CEO why a report is not delivered. While most of Hadoop’s architecture is designed to work in the face of node failure (such as the DataNodes), other components such as the NameNode must be configured with an HA option.

The following is a quick and dirty list of Hadoop HA options:

  • Cloudera CDH4 (free)
    • Uses shared storage
  • Hortonworks (free)
    • Option 1: Use Linux HA (Uses shared storage)
    • Option 2: Use VMWare
  • IBM BigInsights ($$$)
    • GPFS-SNC: Provides a shared nothing HA option
  • MapR M5 ($$$)
    • Shared nothing HA for both NameNode and JobTracker

 

If you’re brave, you can also apply Facebook’s patches to Apache Hadoop to get an “Avatar” based HA option. This is what FB uses in production.