Free Private Data from Silos for Internal Use with Microsoft CodeName "Data Hub"

Both Amazon and Microsoft are hosting large datasets such as those used for healthcare. Windows Azure Marketplace DataMarket now gives enterprises a way of publishing or buying large datasets. This principle is now being extended to allow organizations to provide their knowledge workers with a central Self-Service private DataMarket.

Many organizations accumulate massive quantities of valuable commercial data but don’t have technology in place to make it easily available to data analysts or actionable by business decision-makers. Microsoft’s Windows Azure Marketplace DataMarket provides the infrastructure to syndicate more than 135 free and pay-per-use datasets to the public. The new Codename “DataHub” preview extends the DataMarket’s cloud-based architecture and its Web front-end to provide a self-service interface for private data sources and enable federating DataMarket subscriptions without adding data center assets.

Syndicating large datasets via Web portals has the promise of becoming a significant profit center for organizations that originate private data or curate publicly available data sets. For example, Amazon Web Services (AWS) hosts public data sets, such as the 270-GB Ensembl Annotated Human Genome Data and the U.S. National Institutes of Health’s 200-TB 1000 Genomes Project for use with AWS’s EC2 and Elastic MapReduce services. Not to be overshadowed by AWS, Microsoft Research’s Extreme Computing Group (XCG) implemented the National Center for Biotechnology Information’s (NCBI) Basic Local Alignment Search Tool (BLAST) on Windows Azure in late 2010. According to the XCG, Seattle’s Children’s Hospital used the service to solve a problem in one week that would take a single computer six years to complete. While much genetic data is freely available to scientists, publicity about successful research projects garners commercial users who pay for the cloud-based storage, compute, and bandwidth services required to analyze the free data in conjunction with proprietary information. Making a profit from putting data in the hands of consumers, either as a producer or syndicator, requires efficient online search, data transfer, and payment methods. Apple’s iTunes Store, Amazon’s Music Store, and Microsoft’s Zune Marketplace are example of such methods for digitized music, which is just another data format. Video and eBook content are two other common online data formats for consumers.

Microsoft announced the commercial release of its Windows Azure Marketplace DataMarket, formerly codenamed Project “Dallas”, in October 2010 at the Professional Developer’s Conference. DataMarket corresponds to an information store of free or pay-per-transaction datasets for consumption by knowledge workers and business managers. Prospective users can browse or visualize data filtered by queryable fields, such as geographic regions or date ranges, in a built-in Data Explorer application (see Figure 1). Data is most commonly delivered in *.csv files for importing to Excel worksheets or relational databases, such as Microsoft Access or SQL Server, but is also available in Open Data (OData) XML format for use with .NET, Java, JavaScript, PHP, Ruby and Silverlight applications. By mid-2012, the DataMarket offered more than 135 data sets from 50 publishers in 17 genres ranging from Business and Finance to Weather and Climate. More than 40% of the data sets were offered free of charge and most pay-for-use data sets offer free trial use. OakLeaf Systems offers about 3 million rows of U.S. Air Carrier Flight Delay records from the U.S. Federal Aviation Administration (FAA) as a free data set. (See my Analyze Years of Air Carrier Flight Arrival Delays in Minutes with the Windows Azure HPC Scheduler A Cloudy Place article for more details about this data.)


Figure 1. DataMarket’s Data Explorer displays sample data, in this case flight delays for Southwest Airlines (WN) during the month of January 2012 in a tabular grid by default. Users also can visualize the data in bar, column or line charts. Exporting to PowerPivot or Tableau visualizations, *.csv or *.xml (OData) files is an option. Data Hub’s Explorer is identical except for omission of the DataMarket logo.

Create a Private Self-Service DataMarket for Your Organization’s Data with Microsoft Codename “Data Hub”

IT departments of medium and large enterprises amass large amounts of data about the operations of their organizations, as well as detailed information on customers, suppliers, and other business partners. Historically, this information has been accumulated and analyzed for specific purposes and then archived or discarded. In this era of increasing “big data” consciousness, C-level executives expect analysis of internally generated data to deliver added business value, especially competitive advantage. A major problem is that information workers spend too much time searching for data and not enough time analyzing it for actionable insights with tools they’re accustomed to using-usually Excel.

To overcome these data distribution, discovery, and curation problems, Microsoft’s SQL Azure Labs team leveraged the public DataMarket investment to create Codename “Data Hub” as an early concept of a value-added service that enables IT organizations to provision and manage their own private DataMarket. Like the public DataMarket, Microsoft hosts the Data Hub preview on Windows Azure with SQL Azure tables as the data source at Data Hub is a private preview, so you must request an invitation by completing a form on the Codename “Data Hub” Welcome page, wait for an e-mail with a token and instructions for its redemption. The first step is to specify the Firewall Rules for the IP range of your Data Hub’s users (see Figure 2.)


Figure 2. Specify the IP range for authorized users of your Data Hub in the Create Your Marketplace form. The Firewall Rules are the same as those for SQL Azure server instances. The sample provides read-only access to the public at large.

 Click Create Marketplace to open the home page for your administrative account, which displays your account key for the service. Click the Publish Data menu link and then click the Connect menu link to open the Connect to Your Data Sources page (see Figure 3.)


Figure 3. The Connect to your Data Sources form lets you choose an existing SQL Azure database, one of four free trial SQL Azure databases, or upload *.csv data to an existing database.

 The Upload Your Data … choice uses the Codename “Data Transfer” Software as a Service (SaaS) utility under the covers for uploading *.csv-formatted source data to tables of an SQL Azure server instance. I described the problems I encountered with uploading *.csv files having about a half-million US Air Carrier Flight Delay records with the first preview version of both Data Hub and Data Transfer in a recent blog post. I found a workaround for the problem by using T-SQL’s BULK INSERT command to create a local SQL Server table and George Huey’s SQL Azure Migration Wizard v3.8.8 or v4.0.1 to transfer tables to the SQL Azure data source. This article assumes you have prepared a suitable SQL Azure data source. A more detailed tutorial is available here.

After specifying the data source, click the Publish menu link to start the four-stage publishing process at the 1. Data Source step. Edit the original column names to clarify them for users and clear the checkboxes in the Queryable? column for those fields that users don’t need to use for data filtering (see Figure 4.) The primary key column requires a clustered, no-duplicates index and queryable columns require non-clustered indexes.

Figure 4. Specify the queryable columns for user-defined filters on the dataset in the Data Source page.

Click the 2. Contacts menu link and fill in the name, email alias and phone number of the Data Hub’s administrator, and then click the 3. Details menu link to add descriptions, links to documentation, and the organization’s logo to the offer (see Figure 5).


Click the 4. Status/Review menu choice to validate the entries of your previous steps (see Figure 6.)


Figure 6. The final step for preparing the submission runs a detailed validation test on your work so far. Missing or information or indexes, which appear in red, must be remedied before the Request Approval button appears.

Click the Request Approval button and then click the Approve menu link to display the My Approvals form. Open the Approve/Decline list, select Approve, open the Display Actions and select Marketplace to display the offering page, which you can open by going to the OakLeaf Data Hub, signing in with your Windows Live ID, selecting the Government or Transportation and Navigation links, and clicking the US Air Carrier Flight Delays item (see Figure 7).


Figure 7. Users add or remove subscriptions to their collection in the Offering Name page. Clicking the Explore This Dataset link opens the Data Hub version of the query form shown in Figure 1. This post explains the differences users experience when opening a dataset from the DataMarket and “Data Hub.”


Microsoft Codename “Data Hub” provides a more intuitive and attractive user interface than you would expect from a first preview version of a relatively complex SaaS application running on Windows Azure. Of course, the SQL Azure Labs team had the advantage of a year or more of development and testing by the public Marketplace DataMarket folks to work out the kinks. There’s no guarantee that Microsoft will release Codename “Data Hub” as a value-added Windows Azure SaaS app, but I’d give it at least a 90 percent chance of commercial success.