Many organizations accumulate massive quantities of valuable commercial data but don’t have technology in place to make it easily available to data analysts or actionable by business decision-makers. Microsoft’s Windows Azure Marketplace DataMarket provides the infrastructure to syndicate more than 135 free and pay-per-use datasets to the public. The new Codename “DataHub” preview extends the DataMarket’s cloud-based architecture and its Web front-end to provide a self-service interface for private data sources and enable federating DataMarket subscriptions without adding data center assets.
Syndicating large datasets via Web portals has the promise of becoming a significant profit center for organizations that originate private data or curate publicly available data sets. For example, Amazon Web Services (AWS) hosts public data sets, such as the 270-GB Ensembl Annotated Human Genome Data and the U.S. National Institutes of Health’s 200-TB 1000 Genomes Project for use with AWS’s EC2 and Elastic MapReduce services. Not to be overshadowed by AWS, Microsoft Research’s Extreme Computing Group (XCG) implemented the National Center for Biotechnology Information’s (NCBI) Basic Local Alignment Search Tool (BLAST) on Windows Azure in late 2010. According to the XCG, Seattle’s Children’s Hospital used the service to solve a problem in one week that would take a single computer six years to complete. While much genetic data is freely available to scientists, publicity about successful research projects garners commercial users who pay for the cloud-based storage, compute, and bandwidth services required to analyze the free data in conjunction with proprietary information. Making a profit from putting data in the hands of consumers, either as a producer or syndicator, requires efficient online search, data transfer, and payment methods. Apple’s iTunes Store, Amazon’s Music Store, and Microsoft’s Zune Marketplace are example of such methods for digitized music, which is just another data format. Video and eBook content are two other common online data formats for consumers.
Create a Private Self-Service DataMarket for Your Organization’s Data with Microsoft Codename “Data Hub”
IT departments of medium and large enterprises amass large amounts of data about the operations of their organizations, as well as detailed information on customers, suppliers, and other business partners. Historically, this information has been accumulated and analyzed for specific purposes and then archived or discarded. In this era of increasing “big data” consciousness, C-level executives expect analysis of internally generated data to deliver added business value, especially competitive advantage. A major problem is that information workers spend too much time searching for data and not enough time analyzing it for actionable insights with tools they’re accustomed to using-usually Excel.
To overcome these data distribution, discovery, and curation problems, Microsoft’s SQL Azure Labs team leveraged the public DataMarket investment to create Codename “Data Hub” as an early concept of a value-added service that enables IT organizations to provision and manage their own private DataMarket. Like the public DataMarket, Microsoft hosts the Data Hub preview on Windows Azure with SQL Azure tables as the data source at http://YourHubName.clouddatahub.net. Data Hub is a private preview, so you must request an invitation by completing a form on the Codename “Data Hub” Welcome page, wait for an e-mail with a token and instructions for its redemption. The first step is to specify the Firewall Rules for the IP range of your Data Hub’s users (see Figure 2.)
Click Create Marketplace to open the home page for your administrative account, which displays your account key for the service. Click the Publish Data menu link and then click the Connect menu link to open the Connect to Your Data Sources page (see Figure 3.)
The Upload Your Data … choice uses the Codename “Data Transfer” Software as a Service (SaaS) utility under the covers for uploading *.csv-formatted source data to tables of an SQL Azure server instance. I described the problems I encountered with uploading *.csv files having about a half-million US Air Carrier Flight Delay records with the first preview version of both Data Hub and Data Transfer in a recent blog post. I found a workaround for the problem by using T-SQL’s BULK INSERT command to create a local SQL Server table and George Huey’s SQL Azure Migration Wizard v3.8.8 or v4.0.1 to transfer tables to the SQL Azure data source. This article assumes you have prepared a suitable SQL Azure data source. A more detailed tutorial is available here.
After specifying the data source, click the Publish menu link to start the four-stage publishing process at the 1. Data Source step. Edit the original column names to clarify them for users and clear the checkboxes in the Queryable? column for those fields that users don’t need to use for data filtering (see Figure 4.) The primary key column requires a clustered, no-duplicates index and queryable columns require non-clustered indexes.
Click the 2. Contacts menu link and fill in the name, email alias and phone number of the Data Hub’s administrator, and then click the 3. Details menu link to add descriptions, links to documentation, and the organization’s logo to the offer (see Figure 5).
Click the 4. Status/Review menu choice to validate the entries of your previous steps (see Figure 6.)
Click the Request Approval button and then click the Approve menu link to display the My Approvals form. Open the Approve/Decline list, select Approve, open the Display Actions and select Marketplace to display the offering page, which you can open by going to the OakLeaf Data Hub, signing in with your Windows Live ID, selecting the Government or Transportation and Navigation links, and clicking the US Air Carrier Flight Delays item (see Figure 7).
Microsoft Codename “Data Hub” provides a more intuitive and attractive user interface than you would expect from a first preview version of a relatively complex SaaS application running on Windows Azure. Of course, the SQL Azure Labs team had the advantage of a year or more of development and testing by the public Marketplace DataMarket folks to work out the kinks. There’s no guarantee that Microsoft will release Codename “Data Hub” as a value-added Windows Azure SaaS app, but I’d give it at least a 90 percent chance of commercial success.