Effective Strategies for Storing and Parsing XML in SQL Server

Comments 0

Share to social media

XML is a common storage format for data, metadata, parameters, or other semi-structured data. Because of this, it often finds its way into SQL Server databases and needs to be managed alongside other data types.

Even though a relational database is not the optimal place to store and manage XML data, it is often needed due to application requirements, convenience, or a need to maintain this information in close proximity to other app data.

This article dives into a variety of common XML challenges and the functionality included in SQL Server to help make managing them as simple as possible.

Store XML as XML. Store non-XML as non-XML.

XML will enter our data lives in one of two ways:

  1. Data stored in an XML-typed column.
  2. Data stored in some other format.

Before diving into managing XML, it is important to note that not all XML requires a data engineer to know advanced (or any) knowledge about how to index, filter, update, or use XML.

Sometimes, XML is stored in a database for convenience and is simply written or read by the application directly. In these scenarios, there is little architectural work involved aside from the need to have an XML-typed column available to store the data.

In many ways, this is an ideal situation. The purpose of a database is to store and retrieve data, and not needing to index, filter, or shred XML allows a database to do what it is best at while not introducing complexity into database or application code to manage it. If XML is stored in SQL Server, it will often need to be directly queried, even if filtering, aggregating, or other more advanced functionality is not needed.

If possible, store XML data in an XML typed column. Developers sometimes choose to store XML in string columns instead, such as VARCHAR(MAX). While strings are more familiar to us and easier to work with, they do not provide the many XML-related benefits that an XML-typed column does. Oftentimes, when XML functionality is later needed against a string column, the result is the awkward need to convert from VARCHAR to XML and then back again, as needed. This is inefficient and creates additional opportunities for mistakes to occur as data is being frequently converted, in addition to written, read, filtered, etc…

Similarly: do not store data as XML unless needed. Data or metadata that will be routinely consumed by SQL Server ideally should be stored in a format that is quick and easy to read/write. Whether in a configuration table, metadata table, temporal table, or some other convenient data structure, consider the ideal candidate for storing data before finalizing the decision.

Data types are database design decisions that have long lives ahead of them. Choosing the wrong data type for an important data element can result in poor performance, wasted time and effort, and technical debt that remains for years to come. Therefore, be sure to store XML in XML-typed columns and non-XML in non-XML data types.

Parsing XML

The most common need with XML is to read it, either in its entirety or by pulling out specific details from a document. For the demos in this article, the dbo.DatabaseLog table will be used, which can be found in the AdventureWorksDW database. (You can download a backup to restore here from the Microsoft site. I am using the 2022 version of the database.)

This table contains logs of SQL commands used to create or modify database objects. For example, the following query retrieves a single row from the table:

The results show a column called XmlEvent:

The simplest way to read data from an XML column is by using an XQuery against it. In the following example, several elements are returned, each as a separate column:

Each column specifically retrieves the first element for the various tags provided. The results look like this:

The data type provided alongside the tag detail is the data type that the result will be returned as. Using VARCHAR(MAX) is overkill for EventType, ServerName, and LoginName, as these columns will not exceed 128 characters. It is a fairly typical process to start with a large datatype as you are interrogating the data in XML. Moving forward in this article, data types will be chosen that are more fitting for the data elements provided.

This syntax is perfect when you know exactly what data elements are needed, what they are called, and where to find them. This is an easy way to parse XML when it is served to you nicely on a platter. If only life were always so easy!

OPENXML may also be used to parse an XML document. This is ideally used on a scalar variable or XML that is read into a single variable/parameter. An extra step is needed to prepare the XML document, which converts it into a format that SQL Server can easily consume:

The script above takes the same XML from the previous example, stores it in a scalar XML variable, and then converts it using sp_xml_preparedocument, placing the result into @XMLDoc. (Note that as this is a variable, you will need to execute the following SELECT statement in the same batch of code for it to execute.)

From here, there are a handful of options available. The first allows for specific elements to be retrieved from the XML document, like this:

This syntax will explicitly return the same four elements that were previously parsed out of the XML document. The second parameter (2) indicates that the XML is element centric. If an XML document is attribute-centric, then set this to 1.

Ensure that data types are correct, or an error will be returned. For example, if EventType were declared as an INT, then the following error would be thrown.

Msg 245, Level 16, State 1, Line 61
Conversion failed when converting the nvarchar value 'ALTER_TABLE' to data type int.

If there is a need to return everything from an XML document, then consider the following syntax, instead:

The result will be a hierarchical table that looks like this:

Note that while the column headers will the same no matter what sort of XML document is parsed, the content of the rows will vary greatly. There are many ways to use this data, such as:

  • Return all details for a given attribute.
  • Search for specific values.
  • Validate the existence (or lack thereof) for an attribute or value.

Because XML documents can be large, be sure that you are prepared for the size of the output. If testing, a TOP or COUNT(*) may be used with the SELECT statement that returns the table. This can help with understanding the size of the output before diving into it.

The results use a pairing of id and parentid to create a hierarchy to understand each level of XML within the document. Nodetype indicates the type of XML node that the row references, which can be one of three values:

  1. Element node
  2. Attribute node
  3. Text node

localname is the tag name. If it references an XML text value, then it will contain the string, “#text”.

The column prev indicates the id of the previous element tag. If that element contains many levels of detail, then there can be a long gap of NULLs between each value in this column.

Finally, the column text contains the element text detail, if applicable. It will be NULL for XML metadata and contain values for XML data.

If XML document manipulation is a part of a larger script\batch, then it is a best practice to remove the document when it is no longer needed. This can be accomplished with sys.sp_xml_removedocument, like this:

Creating XML

Sometimes there is a need to create XML documents for use elsewhere in SQL Server, or by an application. There are a variety of ways to do this, which will be outlined here.

The simplest way to generate XML quickly is by using FOR XML AUTO. This mindless syntax can convert the contents of a SQL query into XML. One nice aspect of this syntax is that the query can include any of the standard T-SQL syntax, such as WHERE, GROUP BY, and ORDER BY, for example. The following query generates XML that converts the contents of four columns in dbo.DatabaseLog into elements within an XML document.

Here is the XML output:

Note that table name is included as the tag for each set of attributes and the attribute names are simply the column name. Each row is its own addition to the XML file, as it is attribute centric.

Element-centric XML may be generated by adding ELEMENTS to the FOR XML AUTO clause:

This addition to the query greatly changes the shape of the XML output. The following is a sample of how it looks:

In this scenario, each row has been converted into its own element. Within each element are the detailed gleaned from the columns that were selected in the query. The choice to use attribute-centric or element-centric XML will depend on the application or process that is consuming the document.

There is no need to accept the table name as the root node name. This can be easily adjusted like this:

In this example, the root node will be named “Log” instead of “dbo.DatabaseLog”. Here is a sample of the output:

Since the XML is being generated directly from the query output, columns may be renamed, converted to other data types, or formatted as part of the query. Basically, any of the usual tactics that may be used to adjust how column output looks can be used here as well.

If XML column values are included in the creation of an XML document, they will be added and nested accordingly. For example, the following query includes the column XmlEvent, which itself contains XML data:

The output will include nested attributes based on the contents of each XmlEvent value, like this partial output:

This can be a handy way to combine XML and non-XML data quickly and easily, if needed.

One final way to create XML data in SQL Server is somewhat obvious, but often overlooked, and is to do so directly:

If XML data already exists outside of a table, variable, or parameter, its value can be updated in the same way as if it were any other SQL Server data type.

Updating XML

There are a variety of ways to update XML using standard XML methods. The following are a handful of ways to make inline changes to XML using the XQuery method modify(), without the need to rewrite the entire document.

modify() may be used to insert, delete, and update values within an XML document. This is very much a scalpel that can meticulously adjust XML values without creating change throughout an entire document.

Note that SQL Server expects values to be modified inline as part of the syntax. An attempt to update XML like this will result in an error message:

This will throw a confusing error:

Msg 8137, Level 16, State 1, Line 168
Incorrect use of the XML data type method 'modify'. A non-mutator method is expected in this context.

Could that be any more confusing?! If attempts to use modify() are failing with an error like this, be sure to check the syntax being used and verify it is exactly as demonstrated. Once working, it is easy enough to adjust, but getting it right the first time can be a bit awkward.

It is a bit counter-intuitive to what we are used to, but when updating XML with an UPDATE…SET statement, there is no equals operator. Simply use SET with the column and modify to get the desired results.

Consider the following XML sample:

What if there was a need to add a username to only the second log entry? This can be done using modify() and the insert…into… syntax:

This query will add the username attribute to the end of the Log tag. Specifying [2] tells the query to insert the tag (username) and its associated value (EdPollack) to only the second Log entry in the XML document. The result is as follows:

If the numeric index provided does not correspond to a value in the XML document, the query will still succeed without an error, but no changes will be made. This can be handy when there is a need to conditionally add a tag, but only if a list is of a certain length.

A variable or parameter may also be passed into modify(), using a slightly different syntax:

Instead of providing an explicit value, the variable @UserName is used instead. This allows parameters and other scalar data to be added to an XML document without the need for dynamic SQL or some other messier solution. The result for this query is identical to the previous example.

This syntax may also be used as part of an UPDATE statement against a table, allowing one or many values to be updated in a single batch. Note that the query does not have an equals operator in it:

This example used after instead of into, allowing the query to add the new element directly after the first instance of EventType within EVENT_INSTANCE. The results look like this:

Conditionals may be embedded into a modify() call, allowing for changes to be made based on column data or variable values. An example of this would be to check if the color “Orange” from the prior example is in the XML document in the expected location. Based on that condition, different values could be inserted. Alternatively, a change could be made under one condition, but not another.

The following example checks to see if the colorOrange” is found in EVENT_INSTANCE. If so, then create a new element with the flavorspicy”, otherwise add the flavorsweet”:

I intentionally used some silly examples here as they are very easy to spot in the resulting XML. The results look like this:

Note that the text used within the conditional statement is case-sensitive by default and does not follow the object collation that is defined within a database. Even if the database collation is case insensitive, the conditional text will validate an exact match, including case.

Note that if you execute this multiple times, you will get multiple flavor tags in the XML.

Values can be updated within XML using modify() in a similar fashion by using the replace option. This example changes the color from its previous value to “Purple”:

In the event that there are multiple elements with the same name, an index may be added to specify which to update:

Since only a single element exists for color in this example, the results of both queries are the same.

modify() can delete from XML documents, as well. The following queries show how the data previously added to XmlEvent can be carefully removed.

The first UPDATE statement deletes the element “flavor” and the second query removes the element “color” from the XML document. Once executed, the resulting XML looks like it did before any changes had been made to it in this article:

If needed, an entire hierarchy of XML elements can be deleted at one time. For example, if the TSQL commands stored in these XML documents were rarely used and were consuming significant space, then they could be deleted using modify(). This script removes TSQLCommand for one specific row:

Removing the WHERE clause would allow this deletion to occur for all XmlEvent XML documents in the table, if needed.

The XQuery method modify() has quite a bit of functionality associated with. Full details for how it can be used in SQL Server can be found here:

modify() Method (xml Data Type) – SQL Server | Microsoft Learn

That article has subsections for various use-cases and method parameters/details.

Conclusion

In part one of this series on XML in SQL Server, we covered the basics of how to parse and query XML, as well as how to modify the data in the XML data.

XML is a bit of a double-edged sword in SQL Server. On the one hand, XML is a commonly used tool for storing and moving data between applications. It is an easy-to-use open standard that has an extensive amount of functionality built-in for querying and updating it. SQL Server provides many ways to read and write XML data, as well as optimizations to improve search speeds and reduce its storage footprint.

On the other edge, despite its convenience, XML is not a relational data format and is not naturally compatible with a transactional database engine such as SQL Server. While storing and retrieving XML is not computationally expensive, searching and updating XML can be slow, arduous processes. The ideal use of XML in SQL Server is as a pass-through to allow data to be written, stored, and retrieved when needed by an application.

If filtering or heavy data manipulation is needed, consider normalizing commonly used XML elements into table columns, allowing for SQL Server to do what SQL Server is best at: managing transactional data. Alternatively, if XML can be stored outside of a transactional database, then doing so may be beneficial.

 

Load comments

About the author

Edward Pollack

See Profile

Ed Pollack has 20+ years of experience in database and systems administration, which has developed his passion for performance optimization, database design, and making things go faster. He has spoken at many SQLSaturdays, 24 Hours of PASS, and PASS Summit. This led him to organize SQLSaturday Albany, which has become an annual event for New York’s Capital Region. In his free time, Ed enjoys video games, traveling, cooking exceptionally spicy foods, and hanging out with his amazing wife and sons.