Analyze Years of Air Carrier Flight Arrival Delays in Minutes with the Windows Azure HPC Scheduler

If you are seeking to analyse very large sets of data, and need a highly parallel rapid way of doing it that scales to your requirements, then 'Cloud Numerics' from Microsoft may be the answer to your prayers

If you are seeking to analyse very large sets of data, and need a highly parallel rapid way of doing it that scales to your requirements, then ‘Cloud Numerics’ from Microsoft may be the answer to your prayers.

The Microsoft Codename “Cloud Numerics” Lab from SQL Azure Labs provides a Visual Studio template to enable configuring, deploying and running numeric computing SaaS applications on High Performance Computing (HPC) clusters in Microsoft data centers.

Apache Hadoop, MapReduce, and other Hadoop subcomponents get most of the attention from big-data journalists, but there are many special-purpose alternatives that offer shortcuts to specific data analysis techniques. For example, Microsoft’s Codename “Cloud Numerics” Lab offers a prebuilt .NET runtime, as well as sets of common mathematical, statistical, and signal processing functions for execution on local or cloud-based Windows HPC Server 2008 R2 clusters (see Figure 1.)  Microsoft announced a preview version of the Windows Azure HPC Scheduler at the //BUILD/ conference in September 2011 and released the commercial version on November 11. The Windows Azure HPC Scheduler SDK, together with a set of sample applications, for use with the Windows Azure SDK v1.6 followed on December 14 and the “Cloud Numerics” team released its Lab v0.1 on January 10, 2012.

1645-Figure-1-CloudNumericsComponents.pn

Figure 1: The Microsoft “Cloud Numerics” components

The Microsoft “Cloud Numerics” Lab provides a Visual Studio 2010 C# project template and deployment utility, .NET 4 runtime, and .NET native and system libraries for numeric analysis in a single downloadable package. (Diagram based on Microsoft’s The “Cloud Numerics” Programming and runtime execution model documentation.)

Why Use the “Cloud Numerics” Lab and Windows HPC Scheduler?

The “Cloud Numerics” Lab is designed for developers of computationally intensive applications that involve large datasets. It specifically targets projects that benefit from “bursting” computational work from desktop PCs to small-scale supercomputers running Windows HPC Server 2008 R2 in Microsoft data centers. Few small businesses or medium-size enterprises can afford the capital investment or administrative and operating costs of dedicated, on-premises HPC clusters that might be used for only a few hours or minutes per week. Although Microsoft’s new Apache Hadoop on Windows Azure developer preview offers similar advantages, Hadoop and its subcomponents are Java-centric. This means Microsoft shops can incur substantial training costs to bring their .NET developers up-to-speed with Hadoop, HDFS, MapReduce, Hive, Pig, Sqoop, and other open-source software.

Installing the “Cloud Numerics” package from Connect.Microsoft.com adds a customizable Microsoft Cloud Numerics Application C# template to Visual Studio 2010 that most .NET developers will be able to use out-of-the-box without a significant learning curve. The template generates a solution with these projects automatically:

  • AppConfigure – Publishes a fully-configured Windows HPC Server cluster to Windows Azure aided by a graphical Cloud Numerics Deployment Utility
  • HeadNode – Provides failover and mediates access to the cluster resources as the single point of management and job scheduling for the cluster (see Figure 2)
  • ComputeNode – Carries out the computation tasks assigned to it by the Scheduler
  • FrontEnd – Provides a Web-accessible endpoint and UI for the Windows Azure HPC Schedule Web Portal, which display job status and diagnostic messages
  • AzureSampleService – Defines roles for and number of HeadNode, ComputeNode, and FrontEnd instances; generates ServiceConfiguration.Local.csfg, ServiceConfigure.Cloud.csfg, and ServiceDefinition.csdef files
  • MSCloudNumericsApp – Provides replaceable C# code stubs for specifying and reading data sources, defining computational functions, and determining output format and destination (usually a blob in Windows Azure storage)
1645-Figure-2-WindowsAzureHPCComponents.

Figure 2: The Microsoft Cloud Numerics Application template generates and deploys these components to local Windows HPC Server 2008 R2 runtime instances or Windows Azure HPC hosted services.

The template includes a simple MSCloudNumericsApp project for random-number arrays that developers can run locally to verify correct installation and, optionally, deploy with a Windows Azure HPC Scheduler to a service hosted in a Microsoft data center. My initial Introducing Microsoft Codename “Cloud Numerics” from SQL Azure Labs and Deploying “Cloud Numerics” Sample Applications to Windows Azure HPC Clusters tutorials describe these operations in detail.

Like Hadoop/MapReduce projects, “Cloud Numerics” apps are batch processes. They split data into chunks, run operations on the chunks in parallel and then combine the results when computation completes. “Cloud Numerics” apps operate on distributed numeric arrays, also called matrices. They primarily deliver aggregate numeric values, not documents, and aren’t intended to perform “selects” or other relational data operations. .NET developers without a background in statistics, linear algebra, or matrix operations probably will need some assistance from their more mathematically inclined colleagues to select and apply appropriate analytic functions. On the other hand, math majors will require minimal programming guidance to modify the MSCloudNumericsApp code to define their own jobs and deploy them to Windows Azure independently.

Creating and Running the Air Carrier Flight Arrival Delays Solution

The “Cloud Numerics” team provided three basic sample applications with their Lab v0.1, but none of these demonstrated the advanced parallel data reading and processing capabilities of which the Lab was capable. Microsoft’s Roope Astala filled that void on March 3, 2012 with his “Cloud Numerics” Example: Analyzing Air Traffic “On-Time” Data post to the Microsoft Codename “Cloud Numerics” blog. This sample project used 32 months of On-Time Performance data from the Federal Aviation Administration (FAA) for certificated U.S. air carriers to compute the probabilities of arrival delays of various durations. Astala and the “Cloud Numerics” team imported 32 comma-separated values (On_Time_Performance*.csv) files, which contain about 500,000 rows each from the FAA’s Research and Innovative Technology Administration (RITA) of the Bureau of Transportation Statistics, to 32 publicly accessible blobs in Windows Azure storage (see Figure 3).

1645-Figure-3-FAAOntimeFile01-2012Filter

Figure 3: A few interesting columns of the last 20 rows of the FAA’s On_Time_Performance_2012_1.csv file for January 2012. The view in Excel has been filtered to eliminate data for the 337,097 flights that didn’t have an arrival delay in the month.

A histogram for the first five hours of delay time with a logarithmic Frequency scale shows an exponential decrease in the number of flight delays for increasing delay times starting at about one hour (see Figure 4). Distribution patterns for earlier months are similar.

1645-Figure-4-FlightArrivalDelayHistogra

Figure 4: A histogram of flight arrival delays for 0 to 5 hours created by the Excel Data Analysis add-in’s Histogram tool for an unfiltered version of Figure 3’s worksheet shows an exponential distribution for delay times between about 60 and 300 minutes.

The MSCloudNumericsApp project contains a FlightInfoReader class which implements the IParallelReader interface to allocate blobs to ComputeNode worker instances in round-robin fashion. The Cloud Numerics Deployment Utility sets two as the minimum number of ExtraLarge Compute instances, each of which has 8 cores, each instance processes two *.csv files (see Figure 5). Unlike the Hadoop on Azure team, the “Cloud Numerics” Team doesn’t give you free Windows Azure resources for testing, so you’ll run up costs of 18 cores * $0.15/core * hour = $2.70/hour that your app is deployed. If you specify 4 compute instances so each file gets its own Compute instance, you must request special dispensation from the Windows Azure Billing Support bureaucracy to exceed the 20-core maximum for a Windows Azure subscription, and then pay for 34 cores * $0.15/core * hour = $5.10/hour. (FrontEnd and HeadNodes run on a single core by default, but require two instances to qualify for the Windows Azure 99.95% Service Level Agreement (SLA).)

1645-Figure-5-CloudNumericsDeploymentUti

Figure 5: Specify the Cluster Administrator and number of instances of Head, Compute and FrontEnd nodes on the Cloud Numerics Deployment Utility’s Cluster in Azure page. Specifying the SQL Azure Server instance to use enables the Deploy Cluster button. Specifying the MSCloudNumericsApp.exe on the Application Code page lets you submit different versions of the app to the Windows Azure HPC Scheduler without redeploying the cluster.

The MSCloudNumericsApp project’s Main method includes the URL and Access Key values for the public blobs with the flight data that are stored in Microsoft North Central US (Chicago) data center. You deploy your project to the same data center to maximize performance and keep bandwidth costs down. The Main method also contains code that calculates the mean arrival delay time, its standard deviation, and the times above and below 1, 2, 3, 4, and 5 standard deviations, and then exports it to a FlightDataInfo.csv file stored in a private blob (see Figure 6). The total time to read the 8 million or so data rows into arrays containing arrival delay time in minutes and analyze them with two Extra Large Compute instances (16 cores) was slightly less than 2 minutes.

1645-Figure-6-FlightArrivalDelayHistogra

Figure 6: Code in the MSCloudNumericApp project calculates the mean delay time and its standard deviation in minutes, as well as the number of samples above and below 1 to 5 standard deviations from the mean, and writes the results to a FlightInfoData.csv file such as this stored in a Windows Azure blob.

Astala observed in the “Step 5: Deploy the Application and Analyze Results” section of his March 2012 post:

Let’s take a look at the results. We can immediately see they’re not normal-distributed at all. First, there’s skew -about 70% of the flight delays are [briefer than the] average of 5 minutes. Second, the number of delays tails off much more gradually than a normal distribution would as one moves away from the mean towards longer delays. A step of one standard deviation (about 35 minutes) roughly halves the number of delays, as we can see in the sequence 8.5 %, 4.0 %, 2.1%, 1.1 %, 0.6 %. These findings suggest that the tail could be modeled by an exponential distribution [refer to Figure 6].

This result is both good news and bad news for you as a passenger. There is a good 70% chance you’ll arrive no more than five minutes late. However, the exponential nature of the tail means -based on conditional probability- that if you have already had to wait for 35 minutes there’s about a 50-50 chance you will have to wait for another 35 minutes.

Test-Drive Microsoft Codename “Data Numerics”                  

If you’re an analytical specialist or .NET developer and need to perform numerical analysis on big data, you’ll probably find that Codename “Cloud Numerics” will save you substantial time (and thus money) compared with traditional tools like Hadoop, HDFS, MapReduce, Hive, and Sqoop. Sign up for admission to the “Cloud Numerics” Lab here, download and install the prerequisites on a 64-bit computer, scan the documentation, and follow my recent Analyzing Air Carrier Arrival Delays with Microsoft Codename “Cloud Numerics” tutorial to see if cloudbursting with “Cloud Numerics” and the SQL Azure HPC Scheduler is right for you.