Taking Azure SQL Data Warehouse for a Test-Drive

Azure SQL Data Warehouse uses SQL to query the data, but there are some differences to SQL Server's dialect of TSQL, and queries don't necessarily work in the same way. DBAs are also required to use SSDT to access Azure SQL Data Warehouse. It is worth taking the time to try the preview of the product, and take it for a 'spin around the block', following Robert Sheldon's walk-through.

This is the fourth article in a series of articles I’ve been writing about Azure SQL Data Warehouse. In the previous articles, I introduced you to various SQL Data Warehouse concepts and how to get started using the service and connecting to a database. In this article, I explain how to run different types of T-SQL queries against a SQL Data Warehouse database, using SQL Server Data Tools (SSDT) in Visual Studio.

If you’re not familiar with SQL Data Warehouse or do not know how to connect to a SQL Data Warehouse database, you might want to review one or more of the first three articles:

In this article, we’ll create a SQL Data Warehouse database, add objects to the database, and then query the database. If you want to try out the examples in this article, you must have an active Azure subscription and have set up a resource group and V12 logical SQL server. Refer to the previous articles as necessary if you’re uncertain how to proceed.

Keep in mind that adding a database to SQL Data Warehouse will incur Azure subscription fees, unless you’re running the service under Azure’s 30-day free trial or you have extra MSDN credits. Be certain you understand your subscription type and the fee structure before you start. Also be sure to pause the database when you’re not using it to avoid unexpected or unnecessary charges.

Creating a database

To create a SQL Data Warehouse database, you can use the Azure portal, PowerShell, or the T-SQL CREATE DATABASE statement. Although similar to the CREATE DATABASE statement used for on-premises SQL Server or Azure SQL Database, the SQL Data Warehouse version follows its own syntax:

The syntax is fairly straightforward, but you need to understand the various options. For example, if you specify the COLLATE option, you can pick from only two collations: SQL_Latin1_General_CP1_CI_AS, the default, or Latin1_General_100_CI_AS_KS_WS.

The optional MAXSIZE setting is limited to specific sizes, as defined by SQL Data Warehouse, with 10,240 GB (10 TB) being the default. You can specify any one of the following values: 250, 500, 750, 1024, 5120, 10240, 20480, 30720, 40960, 51200, or 61440.

You must include the EDITION option, and you must specify the value datawarehouse for that option, which is the only setting available to SQL Data Warehouse.

The SERVICE_OBJECTIVE option is also required, and its value must be one of the following predefined data warehouse units (DWUs): DW100, DW200, DW300, DW400, DW500, DW600, DW1000, DW1200, DW1500, or DW2000.

Let’s look at an example that puts these elements together. The following CREATE DATABASE statement creates a database named SdwDB2, with a maximum size value of 250 and a DWU setting of DW100:

That’s all there is to defining a SQL Data Warehouse database. After you run the statement, the service will create the database, though it can take a couple minutes.

When working with T-SQL in SSDT, be aware that IntelliSense is not always accurate in terms of pointing out syntax issues. For example, in the preceding example, I’ve seen IntelliSense highlight the MAXSIZE option as being incorrect, although the statement ran fine.

I’ve seen similar issues in many of the T-SQL statements, and in each case, the statements also ran fine, unless there was a genuine error. Just keep in mind that SQL Data Warehouse is still in preview and there are a few kinks yet to work out (although that’s not stopping Microsoft from charging you for the service).

Also be aware that not all T-SQL statements are supported in SQL Data Warehouse. For example, you cannot issue a DROP DATABASE statement to remove a database. For that you need to use PowerShell or the Azure portal.

While we’re on the topic of databases and unsupported T-SQL statements, note that the USE statement is also not available for changing the database context. In SSDT, the simplest way to achieve that is to select a different database at the top of the query tab.

Working with tables

When working with the SQL Data Warehouse version of a T-SQL statement, you’ll often find that the syntax is similar to how the statement is implemented in SQL Server or SQL Database. However, the SQL Data Warehouse version of a statement is usually not as extensive and sometimes supports options specific to SQL Data Warehouse, as we saw with the CREATE DATABASE statement.

The same can be said of the CREATE TABLE statement. It does not support many options available to SQL Server or SQL Database, and there are a couple options specific to SQL Data Warehouse. For example, the statement does not support primary keys, foreign keys, check constraints, unique constraints, unique indexes, computed columns, sparse columns, user-defined data types, indexed views, identities, sequences, triggers, or synonyms. The statement also does not support certain data types, such as geometry, geography, hierarchyid, and so on. At the same time, the CREATE TABLE statement provides table options that are unique to SQL Data Warehouse or handled differently from standard T-SQL.

Let’s look at an example to see some of these table options in action. The following CREATE TABLE statement creates the FactSales table, which includes several key columns that are theoretically related to dimensions within the data warehouse:

As expected, the table includes no primary key or foreign keys or any of the other forbidden elements. However, it does include several interesting table options in the WITH clause.

The first is CLUSTERED COLUMNSTORE INDEX, which creates a memory-optimized index that includes all table data. This is similar to the CLUSTERED COLUMNSTORE indexing option available to the SQL Server CREATE TABLE statement, but unlike the SQL Server statement, we cannot create nonclustered columnstore indexes on SQL Data Warehouse tables.

The next table option is DISTRIBUTION, which specifies the method used for distributing data across multiple locations (distributions). You can choose one of two options: ROUND_ROBIN or HASH. The ROUND_ROBIN option is the default and distributes the rows evenly across all distributions.

The HASH option assigns each row to a distribution by hashing the value in the specified column (in this case, Customer Key), which can be useful for joining tables and for keeping certain types of data together, consequently improving query performance. For information about whether to choose ROUND_ROBIN or HASH, refer to the Azure help topic Table design in SQL Data Warehouse.

The final table option is PARTITION, which determines how rows are grouped and stored within each distribution. In this case, the partitions are based on the ProductKey column and the specified RANGE options. As a result, five partitions will be created, with data sorted into those partitions according to the following rules:

  • Partition 1: ProductKey <= 1000
  • Partition 2: ProductKey > 1000 and <= 2000
  • Partition 3: ProductKey > 2000 and <= 3000
  • Partition 4: ProductKey > 3000 and <= 4000
  • Partition 5: ProductKey > 4000

When the CREATE TABLE statement runs, it adds the FactSales table to the default schema, dbo. However, we can use the CREATE SCHEMA statement to create additional schemas in a SQL Data Warehouse database. As with other T-SQL statements, the statement’s syntax does not support as many options as the SQL Server statement, but the basics are the same. For example, the following statement creates a schema named sales:

Now we can use an ALTER SCHEMA statement to transfer the FactSales table to the sales schema:

With SQL Data Warehouse, the ALTER SCHEMA statement can be used only for basic database objects. With SQL Server or SQL Database, you can also use the statement to move user-defined data types and XML schema collections.

In rare cases, you will find T-SQL statements for SQL Data Warehouse that are not supported in either SQL Server or SQL Database. For example, SQL Data Warehouse provides the RENAME statement, which lets you rename a user table, as shown in the following example:

The statement changes the name of the FactSales table to FactOrders. In SQL Server and SQL Database, you must use the sp_rename system stored procedure to rename tables, as well as other user-defined objects.

After you create a table, you might want to add data directly from Visual Studio, rather than performing a bulk load operation, which could come later. SQL Data Warehouse supports the INSERT statement, along with other data modification language (DML) statements. As with other types of statements, the DML statements tend to be less robust than you’ll find in SQL Server or SQL Database, but the basic functionality is there.

For example, with SQL Data Warehouse, you cannot precede an INSERT statement with a common table expression, and you cannot pass in multiple rows with a single INSERT statement. However, you can still pass in data to specific columns or to all columns, as in the following example:

The statements insert the data into the target table, just like the would in SQL Server. But as these statements demonstrate, you’ll have to rely heavily on Microsoft documentation for the specifics about how a statement is implemented on different platforms. For example, the MSDN help topic INSERT (Transact-SQL) describes the INSERT statement as it is implemented in SQL Server, SQL Database, SQL Data Warehouse, and Parallel Data Warehouse.

Creating a table from a SELECT statement

An interesting piece of the T-SQL puzzle, as it relates to SQL Data Warehouse, is what Microsoft refers to as the create table as select (CTAS) statement, which is a CREATE TABLE statement that derives its schema and initial data from a SELECT statement.

According to Microsoft, the CTAS statement is a fully parallelized operation and is the simplest and fastest way to copy all or part of a table. The CTAS statement can also be useful for creating a temporary table, such as the one in the following example:

The #CustomerSales table is based on a simple SELECT statement that finds the total sales for each customer. The CTAS statement used to create the temporary table starts with a WITH clause, which works the same as the WITH clause used in a regular CREATE TABLE statement. In this case, the clause creates a clustered columnstore index and sets the distribution type to ROUND_ROBIN.

Temporary tables in SQL Data Warehouse are similar to those in SQL Server or SQL Database except that you cannot create global temporary tables or create views based on temporary tables. However, SQL Data Warehouse temporary tables exist at the session level, which means they can be accessed from anywhere inside the session.

In addition to copying tables and creating temporary tables, the CTAS statement can be useful for working around T-SQL options not supported in SQL Data Warehouse, such as SELECT…INTO, MERGE, or UPDATE and DELETE statements that include ANSI joins:

  • You can rewrite a SELECT…INTO statement as a CTAS statement, simply porting over the logic.
  • You can replace a MERGE statement with a multi-statement transaction, starting with a CTAS statement that creates a table based on SELECT statements joined together using the UNION ALL operator.
  • If an UPDATE or DELETE statement includes an ANSI join, you can replace the statement with a multi-statement transaction that first creates a table based on a SELECT join and then uses the table to create an implicate join with the table to be modified.

You can find specifics about these three alternatives-along with examples-in the Azure help topic Create Table As Select (CTAS) in SQL Data Warehouse.

Creating views and stored procedures

As with SQL Server and SQL Database, views provide a handy way to abstract the underlying schema and persist queries that use optimized table joins or other T-SQL elements. Creating a view in SQL Data Warehouse is also similar to creating one in SQL Server or SQL Database, as shown in the following example:

As you probably expect by now, SQL Data Warehouse views are not as robust as in SQL Server and SQL Database. To begin with, SQL Data Warehouse views can include only the metadata. You cannot create indexed (materialized) views or updateable views. In addition, the views do not support encryption, schema binding, or returning metadata information to the the DB-Library, ODBC, and OLE DB APIs.

SQL Data Warehouse also lets you create stored procedures, although these too are not as robust as the stored procedures in SQL Server. In fact, they are fairly basic, made up primarily of parameter definitions and the main statement block, as shown in the following example:

All we’re doing here is creating a procedure that retrieves a customer’s total sales, based on a specific customer ID. We can then use an EXECTUTE statement to run the procedure, just like in SQL Server:

In this case, we’re passing in the customer ID 4367 in order to get the total sales for that customer.

At their most basic level, working with stored procedures is fairly straightforward, but know that there are a number of limitations in SQL Data Warehouse. For example, you cannot create CLR stored procedures or ones that are temporary, numbered, or extended. The stored procedures also do not support default, table-valued, or read-only parameters, and they do not provide an encryption or replication option.

Querying SQL Data Warehouse tables

You’ll also find a number of other limitations or differences along with way when querying a SQL Data Warehouse database. For example, Microsoft expects you to run the entire data warehouse workload on a single database because SQL Data Warehouse does not support inter-database queries. All the tables you use must be contained within the same database.

Given this limitation, it should come as no surprise that the service also does not support distributed transactions. In addition, you cannot create nested transactions or set a savepoint within a transaction.

Another T-SQL limitation has to do with variables. Although you can define variables like you can in SQL Server, you cannot use a SELECT or UPDATE statement to assign a value to a variable.

SQL Data Warehouse also does not support blob data types, including varchar(max) and nvarchar(max). If you plan to construct dynamic SQL statements that will exceed the 8000-byte maximum, you’ll have to break up the string in manageable chunks, as in the following example:

Although this is just a simple example, it demonstrates how you might break up a T-SQL statement and then use an EXECUTE statement to concatenate the string and run the statement in its entirety.

There is one T-SQL element that you get with SQL Data Warehouse you don’t get with SQL Server or SQL Database: the LABEL query option. The option lets you assign a name to a query, which you can then use to retrieve data about that query. To add a label, you must include the OPTION clause in your query, along with the LABEL option, as shown in the following example

You can specify the LABEL option on its own, as I’ve done here, or include it along with one or more query hints, such as HASH JOIN or MERGE JOIN. After you run your query, you can then reference the query by its label. For example, the following SELECT statement retrieves the status and elapsed time from the sys.dm_pdw_exec_requests dynamic management view:

When I ran the query, the SELECT statement returned a status value of Completed and a total_elapsed_time value of 1016 (for milliseconds).

T-SQL in SQL Data Warehouse

There’s a lot more to querying a SQL Data Warehouse database than what we’ve covered here, but this should give you a good place to start. The important point to take out of all this is that you can’t assume your queries will run against a SQL Data Warehouse database just because they run against a SQL Server database. Even the statements you run against SQL Database won’t necessarily work the same way, even though SQL Data Warehouse leverages the SQL Database technologies. When first starting out, you’ll likely want to refer often to the Transact-SQL Reference (Database Engine) on the MSDN site.

Given that SQL Data Warehouse is still in preview and, like any cloud service, is a work in progress, we’ll likely see the T-SQL elements expanded and refined and brought more in-line with SQL Server and SQL Database, at least where practical. Even so, there is a lot you can already do, and it’s worth taking the time up-front to start familiarizing yourself with how T-SQL is implemented in SQL Data Warehouse if you plan to work with its databases.