Distributed databases, such as the Apache Hadoop framework, Amazon Simple DB, and Windows Azure storage, are commonly used to store and manipulate large collections of unstructured data. For example, Yahoo! runs Apache Hadoop and its HDFS file system on more than 10,000 Linux cores to power its Search Webmap. However, most IT organizations favor a RDBMS for storing and querying structured data, such as Web logs and application diagnostics data, as well as financial information. DBAs are familiar with managing RDBMSs with SQL’s data definition and data management languages and most application developers are conversant with its data query language. The learning curve for Apache’s MapReduce software framework, Hive data warehouse infrastructure, Pig data-flow language, Mahout data mining library, ZooKeeper coordination service, and other Hadoop-specific applications is quite steep for most SQL professionals and costly for IT organizations of all sizes.
SQL Azure’s Transact-SQL (T-SQL) dialect implements most SQL Server 2008 R2 keywords, easing the transition from on-premises to public cloud environments. George Huey’s open source SQL Azure Migration Wizard (SQLAzureMW) utility automatically migrates SQL Server 2008 R2 database schemas and data to SQL Azure databases running in Microsoft data centers. So far, the primary restraint on the adoption of popular RDBMSs, such as Microsoft SQL Server, for big-data projects in public cloud environments has been maximum database size. For example, SQL Azure databases were limited in size to 50 GB until December 2011, when Microsoft enabled scaling up to 150 GB. HDFS and SQL Azure provide an original and two replica versions of all data and automatic failover to achieve high reliability.
Scaling Out Relational Databases in Public Clouds with Sharding and Federation
Scaling out web sites with farms of application servers fronted by load-balancers to accommodate spikes in page requests is a common practice, but scaling up database size doesn’t assure sufficient resources to meet surges of connection requests. Microsoft’s “SQL Azure Performance and Elasticity Guide” TechNet wiki article-last updated December 21, 2011-states, “As of this writing, each SQL Azure computer is equipped with 32 GB RAM, 8 CPU cores and 12 hard drives” in its “Performance Profile of a Single SQL Azure Database” section. This statement implies that the amount of RAM, as well as number of cores and hard drives, doesn’t increase with database size. Further, the “SQL Azure Throttling” section of the same article discloses, “To ensure that all subscribers receive an appropriate share of resources and that no subscriber monopolizes resources at the expense of other subscribers, SQL Azure may close or ‘throttle’ subscriber connections under certain conditions. … As the name implies “Engine Throttling” scales back resource usage by blocking connectivity of subscribers that are adversely impacting the overall health of the system. The degree to which a subscriber’s connectivity is blocked ranges from blocking inserts and updates only, to blocking all writes, to blocking all reads and writes.”
To circumvent throttling requires scaling out by partitioning the relational data across multiple databases, a process called sharding, which Microsoft renamed federation. Federation, which arrived in the December 2011 SQL Azure upgrade, also increases database capacity limits by reducing the size of individual databases (shards or federation members) and raises the resources available to each shard’s database engine. If you federate your databases primarily for satisfying surges in query execution, rather than database size reduction, elasticity lets you reduce the number of members by merging their data when demand falls off. SQL Azure Web database pricing is linear with size, $9.99 per GB for 1 to 9 GB, so if you have 9 GB or less data and can shard it into up to nine 1-GB or smaller databases of uniform size, you get up to nine times the potential performance at no additional cost. My Generating Big Data for Use with SQL Azure Federations and Apache Hadoop on Windows Azure Clusters tutorial of 1/8/2012 explains how to set up a sample multi-GB Windows Azure Diagnostics data source (WADPerfCounters) for initial import into SQL Server 2008 R2 SP1 with the BULK INSERT instruction.
Defining the Workflow for Big-Data Federations
Migrating a SQL Server 2008 R2 database to a basic SQL Azure database with the SQL Azure Migration Wizard v3.8 (SQLAzureMW) is a simple two or three step process, which I described for an earlier version in a Using the SQL Azure Migration Wizard v3.3.3 with the AdventureWorksLT2008R2 Sample Database tutorial of 7/18/2010. However, SQLAzureMW isn’t designed to create schemas for federated databases or upload data to them; the new SQLAzureFedMW v1.0 uploads data only. You create the federated database with T-SQL DDL statements in the new SQL Azure Management Portal’s UI or in SQL Server Management Studio 2008 R2 [Express]. Figure 1 shows the steps required to create, populate, and test a six-member federation of WADPerfCounters, with members federated on a federation key of the int data type derived from the CounterName column value.
My Creating a SQL Azure Federation in the Windows Azure Platform Portal tutorial of 1/11/2012 describes in detail how to use the SQL Azure Management Portal’s UI to create a new server account, add a database, and then create the first WADPerfCounters member of a new WADFederation federation. Loading Big Data into Federated SQL Azure Tables with the SQL Azure Federation Data Migration Wizard v1.0 described uploading the first small dataset to enable splitting the first member into five more. Figure 2 shows the Management Portal’s federation page after splitting the first member (1, LOW) into five new members (2, 3, 4, 5, and 6) containing WADPerfCounter rows with the CounterId value.
Here’s the T-SQL statement that created the first member:
USE FEDERATION [WADFederation] ([Id] = -2147483648) WITH FILTERING = OFF, RESET
CREATE TABLE [WADPerfCounters](
[PartitionKey] [bigint] NOT NULL,
[RowKey] [varchar](100) NOT NULL,
[Timestamp] [datetime2](7) NOT NULL,
[EventTickCount] [bigint] NOT NULL,
[DeploymentId] [varchar](50) NOT NULL,
[Role] [varchar](20) NOT NULL,
[RoleInstance] [varchar](20) NOT NULL,
[CounterName] [varchar](100) NOT NULL,
[CounterValue] [decimal](16,8) NOT NULL,
[CounterId] [int] NOT NULL,
CONSTRAINT [PK_WADPerfCounters] PRIMARY KEY CLUSTERED
FEDERATED ON (Id = CounterID)
The ([Id] = -2147483648) expression specifies the initial member with the lowest allowable integer value. The FEDERATED ON (Id = CounterID) expression creates the table as a member of the active database’s federation.
The next step is to perform a test population with a nominal number of additional rows-398,000 for this example-to verify that SQLAzureFedMW and the federation software routes the data to the correct member numbers. Figure 3 shows SQLAzureFedMW’s second step, selecting the federation and members to which to upload the data.
When you click Next, SQLAzureFedMW begins uploading data from the BCP file via secure TDS in parallel over, in this example, six connections. Microsoft was experiencing problems with SQL Azure and the SQL Azure Management portal in their North Central data center on January 12th and 13th 2012 when I was running these tests. Figure 4 illustrates network failures, which testing confirmed occurred on the member 4 thread.
50,167 out of 66,500 rows were uploaded when these errors commenced, and the 16,333 remaining rows were prevented from uploading. My Adding Missing Rows to a SQL Azure Federation with the SQL Azure Federation Data Migration Wizard v1 tutorial of January 15, 2012 explained how to upload the rows to a reference (non-member table) named WADPerfCounters4. I used the following MERGE instruction to prevent primary key constraint conflicts with rows already uploaded:
MERGE INTO [WADPerfCounters] as target
USING WADPerfCounters4 as source
WHEN NOT MATCHED BY TARGET THEN
INSERT ([PartitionKey], [RowKey], [Timestamp], [EventTickCount],
[DeploymentId], [Role], [RoleInstance], [CounterName], [CounterValue],
VALUES(source.[PartitionKey], source.[RowKey], source.[Timestamp],
source.[EventTickCount], source.[DeploymentId], source.[Role],
source.[RoleInstance], source.[CounterName], source.[CounterValue],
The MERGE instruction is new in SQL Server 2008 R2.
While the process is described here and in more details by my four tutorial posts, after you’ve run through this drill a time or two, you’ll have George Huey’s SQLAzureFedMW to thank for enabling you to perform processes of this complexity in 30 minutes or less, not counting data download and upload time.