Faking Arrays in Transact SQL

It is a simple routine that we all need to use occasionally; parsing a delimited list of strings in TSQL. In a perfect relational world, it isn't necessary, but real-world data often comes in a form that requires one of the surprising variety of routines that Anith Sen describes, along with sage advice about their use.

Introduction

Sometimes SQL programmers come up with a requirement to use multi-valued columns or variables in an SQL query, especially when the data comes from somewhere else. Since there is no built-in support for any such datatype as an array or list  in t-SQL, quite often folks use delimited VARCHAR strings to lump multiple values together.

We all know that having multiple values in a single column is against the fundamental tenet of relational model since an attribute in a relation can have only one value drawn from an appropriate type.

SQL tables, by their very nature, do not allow multiple values in its columns. This is a sensible stricture which, even before the XML datatype appeared, programmers have occasionally been defeating with varying degrees of success. Users can interpret a single string as a combination of smaller strings usually delimited by a character such as  a comma or a space. Therefore, by leveraging the existing string functions, programmers can extract such smaller parts from the concatenated string.

It is very important to note that some of the methods that we’ll describe are kludgy, while others appear to violate the fundamentals of data independence altogether. Such methods are only suggested here only for the sake of completeness, and are not recommended for use in production systems. If you are using any methods that are undocumented in the product manual, use them with due caution and all relevant caveats apply. Basic recommendations for using proper datatype conversion techniques, avoiding positional significance for columns etc must be considered for stable production code.

Considerations

Most of the methods of parsing an array and using it for data manipulation are used to insert data as multiple rows into tables. There are other occasions where the programmer has the task of cleaning and querying  badly-designed data represented in a table.

The following sections illustrate a variety of methods one can employ to identify and enlist subsections of a string represented in a variable, parameter or even as a column value in a table. The examples use a comma separated list, commonly known as a CSV string where the value is represented as:
           ‘item-1,item-2,item-3….item-n’.
In practice, you can use any character including a space to delimit and improvise the methods accordingly.

For the examples below, a few customer identifiers are randomly chosen from the Customers table in the Northwind database. For each example below, we’ll use this list.

Northwind is a sample database in SQL Server 2000 default installations. You can download a copy from the Microsoft Downloads

Direct Data comparison

For simple comparisons, there is no need for complicated routines. The inherent pattern matching features in Transact SQL can be used directly in most cases. One prime example is to pass in a list of values and use it in the IN list of a WHERE clause. Here are some common methods:

Using the CHARINDEX function:

Using pattern matching with PATINDEX:

Using LIKE operator for pattern matching

String Parsing

In many cases, you may want to use the parsed list of values as a resultset that can be used in subsequent operations. For instance, for larger lists, it may be more effective to use a JOIN rather using an IN() clause. Another common scenario is the case of multi-row inserts where the list is parsed and the individual elements are inserted using a single INSERT statement.

Using the PARSENAME() function

The PARSENAME function returns a SYSNAME value (equivalent to NVARCHAR(128) and is generally used for identifying part of a database object like object name, owner name, database name, and server name. However it can also be used for parsing a string if the string is made up of less than five delimited values. Typical scenarios for applying this approach would be splitting up a full name, identifying parts of an IP address etc.

For example:

Solutions using a table of numbers

In most cases with larger strings, the faster solutions are often the ones using a table of sequentially incrementing numbers. However, the performance assertions in general should be taken with a grain of salt since, without testing, it is almost impossible to conclude which method performs better than another.  Before we go through the various solutions that use a table of numbers we’ll run through a few of the approaches to creating a number table.

Creating a Number (Sequentially incrementing values) table in T-SQL

A table of monotonically increasing numbers can be created in a variety of ways. Either a base table or a view or any expression that can create a sequence of numbers can be used in these scenarios. Some of the common methods to create a table of numbers  are detailed in the next section.

Though sequentially incrementing values can be generated as a part of the query, generally, it is recommended that you create a permanent base table and insert as many numbers as you need for various solutions. It is also advised to make sure the number column is set as a Primary Key to avoid any potential duplication of rows.

Identity based sequences

Using the default values for an Identity Column in a table, one can simply insert default rows to generate the sequence. One consideration in this approach is that it can be used only with permanent base tables. Note that without any arguments IDENTITY property uses the value 1 for both seed and increment.

With the IDENTITY() function we can use SELECT INTO a new table from any existing table. This example demonstrates using a CROSS JOIN between two existing tables.

Regular loops using WHILE

Here is a simple procedural loop with inserting one row at a time.

The efficiency of the above loop ca be improved using the following suggestion from a newsgroup posting by Itzik Ben-Gen.

Sequences based on existing tables

Using an existing base table to view can be an easy way to generate a sequence of numbers, esp. when you don’t want to create a permanent table to support a single querty. One drawback to such mechanisms is that as the dataset gets larger, the comparative efficiency of the inserts can become an issue.

Here is an example using a correlated subquery based on the unique column.

This is the same logic as above, but using a self join

Another quick way to generate the sequence is to use the ROW_NUMBER() function. This example uses a CROSS JOIN just to illustrate the ease of formulation

With ROW_NUMBER()

For each row returned in a resultset, ROW_NUMBER function returns a sequential number, starting at 1. One can use any existing table, view or any resultset to generate a ROW_NUMBER() expression. Here is an example:  

As a matter of fact, you can use any arbitrary expression or built-in functions like CURRENT_TIMESTAMP or NEW_ID() in the ORDER BY clause to generate the sequential number.

Approaches using Common table expressions

Common table expressions are aliased resultsets derived from a simple query and defined within the execution scope of a single DML statement. A CTE, with an alias, optional column list and a definition query, can help simplify complex queries and improve readability. In general, any query can be written as a CTE query, but there are certain instances where a query CTE construct can offer the best performing query.

Using a CTE, some of the above methods for generating a sequence of numbers can be simplified.

Here is a simple transformation of derived table of unit digits. The logic is pretty much self explanatory.

A Recursive CTE is common table expression can include references to itself. With an anchor member and a recursive member, a recursive CTE can generate a number listing pretty efficiently.  Also note the default recursion limit can be changed by using the OPTION (MAXRECURSION) to the desired limit (0 denotes unlimited).

Here is a clever method again by Itzik Ben-Gen using a CROSS JOIN of nested CTE references. This is not only quick but can be used for generating larger datasets.

Parsing a delimited string using a Number Table

Now we have our number table, we can start using it a number of different ways to split fake arrays. The general idea here is to use the sequence value as the second argument for the SUBSTRING() function and then use the WHERE clause to limit the number of splits using a terminating criteria. Note than this general logic can be written in a variety of ways.

Similar logic from a query in a public newsgroup posting by Linda Wiezbecki.

Here are a couple of methods that not only parses the delimited list, but returns the position of each item in the list as well. This one is using a self join and is popularized by Joe Celko in SQL newsgroups. postings.

The following is from a posting by Umachandar Jayachandran that uses the REPLACE function to get the number of commas in the string and calculate the position of each item value.

The following method also gives you the positional value and is simpler

With the ROW_NUMBER() and RANK() functions introduced in SQL 2005, one could create the positional values much easily like:

You can wrap any of these methods using a table of sequentially incrementing numbers into a table valued UDF or another stored procedure and make it more generic, reusable and handy.

With Common Table Expressions

Most of the methods described above can be re-written using common table expressions. Here is one way of using it to get the starting position and ending position of each element item and then using SUBSTRING to extract the value:

Another concise approach with Common table expressions is to use a recursive CTE. It does not use a table of sequentially incrementing numbers, however it can be a bit inefficient for larger strings.

Here is an example.

A similar approach is suggested by Hari Mohan, a systems analyst with BFL software:

As mentioned before, any of the above mentioned parsing routines can be written as a table valued user defined function or a view and can be used directly in the queries like:

Methods that replace the delimiters

Apart from the above mentioned approaches there are certain other tricks which can be used in Transact SQL. Basically the following methods uses Dynamic SQL, a bit different from traditional queries, but can be used as an approach for smaller string parsing requirements in certain cases.

Here is an example using Dynamic SQL with IN list in the WHERE clause

This is another illustration of replacing the element delimiters with ‘ UNION SELECT ‘. The results can be stored in a table variable or table variable and then can be used in subsequent operations.

A similar approach is proposed by Alejandro Mesa in a public newsgroup posting that uses separate INSERT statements as well. Here is an example where @t is a valid table created prior to execution:

XML solutions

OPENXML function

An approach that is getting much attention, is the OPENXML method. This method, thought may not be ideal for larger datasets, can be effectively used for relatively small number of items, especially if the XML string is generated from a client application. The basic idea is to pass the values as an XML document to the stored procedure instead of a CSV. Here is an example. Note that you can use the REPLACE function to change a CSV string to XML format, say within a stored procedure for instance.

One could generate the xml document from the csv variable with only the id values easily like:

The nodes() method

This is a feature that was introduced in SQL Server 2005 using the xml datatype. You can use the nodes() method to shred the data that can be mapped to a new row. The value() method can be applied as shown below to extract the individual elements:

Using a CLR function

With the popularity of CLR functions in SQL 2005 and beyond, string parsing has become much more versatie. It is much easier to create a simple CLR function, deploy the assembly and then use it as a function within t-SQL code. To implement this function you will have to enable the database option CLR enabled.Here is a C# function example that you can create in Visual Studio under the database project :  

Once you build this function and created the .dll you can create an assembly and a table valued function that can be accessed from within t-SQL

The usage is pretty straight forward like:

You can find several comparable CLR functions in newsgroups and blogs with heavy discussions on the performance implications of using them. For some of the popular benchmarks I would recommend the article by Erland mentioned in the references section.

Simple WHILE loop

Another popular method is to use a procedural WHILE loop. Very popular among programmers by virtue of its simplicity, this method is not particularly efficient for larger datasets. Here is an example:

Inserting parsed values into multiple columns

When the passed values are to be returned as column values in a single row rather than multiple rows, we can use the PIVOT function. Since we will always know the number of columns that will be in the resultset, formulating the query with a static PIVOT list will not be too difficult.

Conclusion

Be judicious and apply common sense while using any of the string parsing routines in t-SQL. Each of the methods described above has certain benefits in some scenarios and may not be appropriate for all situations. If you need additional detailed analysis of these and a few other methods including performance considerations, consider the articles in the references section.

References

Acknowledgements

Aaron Bertrand, Erland Sommarskog, Umachandar Jayachandran, Linda Wierzbecki, Joe Celko, Alejandro Mesa, Kristen Hodges and Itzik Ben-Gan.