Analyze Big Data with Apache Hadoop on Windows Azure Preview Service Update 3

Hadoop and MapReduce have good prospects for adoption as a standard for big data analysis, especially since its adoption by Microsoft. It is ideal for Cloud usage since one can spin up nodes when required, pay only for storage and compute services whilst they are running. Roger Jennings describes how to get it running on Azure.

Big data is receiving more than its share of coverage by industry pundits these days, while startups offering custom Hadoop and related open-source distributions are garnering hundreds of millions of venture capital dollars. The term Hadoop usually refers to the combination of the open-source, Java-based Hadoop Common project and its related MapReduce and Hadoop Distributed File System (HDFS) subprojects managed by the Apache Software Foundation (ASF). Amazon Web Services announced the availability of an Elastic MapReduce beta version for EC2 and S3 on April 2, 2009. Google obtained US Patent 7,650,331 entitled “System and method for efficient large-scale data processing,” which covers MapReduce, on January 19, 2010 and subsequently licensed the patent to ASF.

Microsoft put its oar in the big-data waters when Ted Kumert, Corporate VP of Microsoft’s Business Platform Divison announced a partnership with Hortonworks to deliver Apache Hadoop distributions for Windows Azure and Windows Server at the Professional Association for SQL Server (PASS) Summit 2011 on October 12, 2011. (Hortonworks is a Yahoo! spinoff and one of the “big two” in Hadoop distribution, tools and training; Cloudera is today’s undisputed leader in that market segment.) Despite an estimated 58% compounded annual growth rate (CAGR) for commercial Hadoop services, reaching US$2.2 billion in 2018, Gartner placed MapReduce well into their “Trough of Disillusionment” period in a July 2012 Hype Cycle for Big Data chart (see Figure 1).


Figure 1. Gartner’s Hype Cycle for Big Data chart

The Hype Cycle for Big Data chart places MapReduce and Alternatives in the Trough of Disillusionment with two to five years to reach the Plateau of Productivity despite rosy estimates for a 58% CAGR and US$2.2 billion of Hadoop service income by 2018. GigaOm’s Sector RoadMap: Hadoop platforms 2012 research report describes a lively Hadoop platform market. I believe Gartner is overly pessimistic about Hadoop’s current and short-term future prospects.

The SQL Server Team released an Apache Hadoop for Windows Azure (AH4WS) CTP to a limited number of prospective customers on December 14, 2011. I signed up shortly after the Web release but wasn’t admitted to the CTP until April 1, 2012; my first blog post about the CTP was Introducing Apache Hadoop Services for Windows Azure of April 2. Apache Hadoop for Windows Azure (AH4WS) Service Update 2 of June 29, 2012 added new features, updated versions (see Table 1) and increased cluster capacity by 2.5 times, which broke the logjam of users waiting to be onboarded to the CTP. Service Update 3 of August 21, 2012 added a REST APIs for Hadoop job submission, progress inquiry and killing jobs, as well as a C# SDK v1.0, PowerShell cmdlets, and direct browser access to a cluster.

Table 1. The components and versions of the SQL Server team’s Apache Hadoop on Windows Azure CTP service updates of June 29, 2012 (SU2) and August 21, 2012 (SU3). Component descriptions are from the Apache Software Foundation, except CMU Pegasus.

Component Description SU2 Version SU3 Version
Hadoop Common The common utilities that support other Hadoop subprojects 1.0.1
Hadoop MapReduce A software framework for distributed processing of large data sets on compute clusters 1.0.1
HDFS Hadoop Distributed File System, the primary storage system used by Hadoop applications 1.0.1
Hive A data warehouse infrastructure that provides data summarization and ad hoc querying   0.7.1 0.8.1
Pig A high-level data-flow language and execution framework for parallel computation   0.8.1 0.9.3
Mahout A scalable machine learning and data mining library  0.5 0.5
Sqoop A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores, such as relational databases   1.3.1 1.4.2
CMU Pegasus A graph mining package to compute degree distribution of a sample graph*   2 2

*Pegasus is an open-source graph mining package developed by people from the School of Computer Science (SCS), Carnegie Mellon University (CMU).

Apache Hadoop for Windows Azure Basics

Cloud computing is a natural fit for big data analysis with Hadoop MapReduce and its subprojects because users can spin up nodes when required, pay only for storage and compute services while running, and shut the nodes down when the process completes. AH4WA runs MapReduce in Windows Azure’s Java virtual machine (JVM) to provide horizontal scalability and high reliability. MapReduce is a batch programming model with a procedural language for mapping (dividing) input data into multiple sub-problems and distributing the sub-problems to worker nodes, which run in parallel under HDFS on clusters of many computers. For example, according to Hortonworks CEO Eric Baldeschwieler, Yahoo! had in mid-2011 a production Hadoop cluster with 42,000 nodes holding 180 to 200 petabytes of data. Facebook’s Jay Parikh claimed in August 2012 that his company runs the world’s largest Hadoop cluster, which spans more than 100 petabytes of data and analyzes about 105 terabytes every 30 minutes.

The map() function returns a list of key/value pairs. The reduce() function collects the list and aggregates (summarizes) the values in a HDFS text file.  Hadoop excels at processing unstructured text files. For example, following is the Java code for the widely-distributed, canonical WordCount MapReduce sample program, which counts the number of different words in a set of documents (courtesy of Wikipedia):

AH4WA’s initial release offered users a choice of four no-charge cluster sizes, ranging from small (4 nodes with 2 TB disk space) to extra-large (32 nodes with 16 TB disk space), hosted in Microsoft’s North Central US (Chicago) data center. Cluster lifetime was 48 hours and users could renew their clusters during the last few hours of its life. SU2 provided only a two-node cluster with 1 TB diskspace to accommodate more users. SU3 offers three nodes and 1.5 TB disk space with a five-day, non-renewable lifetime at no charge (see Figure 2). However users can request more cluster nodes and diskspace with an indefinite lifetime by paying standard Windows Azure compute and storage charges.


Figure 2. Signing up for a new free Hadoop cluster.

You can sign up for a free Hadoop cluster at by specifying a unique DNS prefix name and login username/password combination, and then clicking the Request Cluster button. Click the Downloads tile for links to SU3’s new Hadoop C# SDK and PowerShell cmdlets, as well as the previously available 32/64-bit ODBC drivers and Excel add-in for analyzing Apache Hive data.

AH4WA UD3’s Management Dashboard

After you’ve created your cluster, you can open the landing page of UD3’s new URL-enabled management dashboard by typing in your browser’s address bar, for this example. Ignore the certificate error you receive from AH4WA UD3’s first release by clicking the Continue to This Website button; the team promises to fix this issue shortly. Then, enter the username and password credentials you specified in Figure 2 to launch the modern UI (née Metro) style RESTful dashboard’s landing page (see Figure 3).


Figure 3. Click a tile to open a page to manage each feature set of AH4WA, review job history or create a new MapReduce job. SU2 added five new samples to the original four and support for Apache Sqoop.

One of the AH4WA team’s goals was to make Hadoop and MapReduce technologies more accessible to developers and users who aren’t data scientists or business intelligence/data warehousing experts. AH4WA’s Interactive Console makes JavaScript a first-class language for running MapReduce jobs. The console also enables writing ad hoc Apache Hive queries in a text box and executing them interactively (see Figure 4). Hive translates familiar SQL-like aggregate queries written in HiveQL to MapReduce instructions and executes them automatically. An ODBC Server and Hive Add-In for Excel combine with Excel PowerPivot to provide do-it-yourself big data analysis for end users. My Using Excel 2010 and the Hive ODBC Driver to Visualize Hive Data Sources in Apache Hadoop on Windows Azure post is a detailed tutorial for this feature.


Figure 4. Interactive Hive

Write a HiveQL statement in Interactive Hive console’s text box and click Evaluate to execute it with MapReduce, if necessary. The HiveQL describe keyword returns a list of column names and data types for a Hive table, the prebuilt hivesampletable for this example.

The most common data sources for MapReduce analysis are structured or unstructured text files. Hive requires structured data with a schema, such as Web-server log files or the U.S. Federal Aviation Administration’s air carrier flight delay data. The Manage Cluster page offers disaster protection with optional periodic data snapshots, and lets you choose data stored in Amazon S3 or Windows Azure blobs, or downloaded from the Windows Azure Marketplace DataMarket instead of on-premises text files (see Figure 5).


Figure 5. Configure cloud data from Windows Azure or Amazon Web Services as a MapReduce or Hive data source on the Configure Cluster page. (ASV is an abbreviation for “Azure Storage Vault.”) SU3 didn’t fix the problem with accessing public Windows Azure blobs caused by requiring a Passkey to enable the Save Settings button.

 AH4WA SU3 didn’t add new samples to SU2’s nine, despite a “more coming soon” promise (see Figure 6). My detailed tutorial for running the TeraGen, TeraSort and TeraValidate jobs of the 10 GB GraySort sample is here.


Figure 6. The Sample Gallery

 This page offers demonstrations of classic MapReduce demo jobs, such as GraySort, Pi Estimator, and Word Count, as well as Mahout, Pegasus, and Sqoop projects. The C# Streaming example shows you how to execute a C# application using the Hadoop streaming interface v1.0.3.

SU3’s C# SDK 1.0 for Interacting with AH2WA Clusters Programatically

Brad Sarsfield of the AH2WA team updated on July 16, 2012 a prototype HadoopOnAzureJobClient .NET library for SU2, which used a REST API  “to submit jobs against the Hadoop on Azure free preview” for download as a file or from SU3 supercedes this library with Microsoft.Isotope.ClusterService and Microsoft.Isotope.Sdk namespaces contained in the archive that you can download from the link on the Downloads page or here. Expand the files, which depend on .NET Framework 4, and add references to the DLLs in a new C# project.

The SDK’s Readme.txt file provides terse documentation to:

1. Get status of a running job:

2. Kill a running job:

3. Get the Id’s of all running jobs on the cluster:

4. Get history of all jobs that have completed:

5. Get job history for a specific job:

6. Upload a file to HDFS (job related assets):

7. List files in HDFS:

8. Get all MapReduce jobs that have completed:

9. Get the details of a single mapreduce job:

The library also includes six instructions to Get, Set and Delete credentials for ASV and S3 data sources.

With its third Service Update, AH4WS has gained the maturity needed to complete with other cloud-based Hadoop implementations, such as Amazon’s Elastic MapReduce. The CTP now delivers bonuses in the form of an SDK for .NET developers and an intuitive, a browser-based Web UI for non-programmers, and an Excel Hive add-in for do-it-yourself business analysts.