SQL Server XML Cribsheet

If you're like me and find XML a bit of an acronym minefield, Robyn Page's Cribsheet will help sort out your XSLT from your XDM.

SQL Server XML- The Crib Sheet

For things you need to know, rather than the things you want to know
This is written with the modest ambition of providing a brief overview of XML as it now exists in SQL Server, and the reasons for its presence. It is designed to supplement articles such as Beginning SQL Server 2005 XML Programming.

Contents

Introduction

XML has become the accepted way for applications to exchange information. It is an open standard that can be used on all technical platforms and it now underlies a great deal of the inter-process communication in multi-tiered and distributed architectures.

XML is, like HTML, based on SGML. It is a meta-language, used to define new languages. Although it is not really a suitable means of storing information, it is ideal for representing data structures, for providing data with a context, and for communicating it in context

Previous versions of SQL Server relied on delivering data to client applications as proprietary-format ‘Recordsets’, either using JDBC or ODBC/ ADO/ OLEDB. This limited SQL Server’s usefulness on platforms that could not support these technologies. Before XML, data feeds generally relied on ‘delimited’ ASCII data files, or fixed-format lists that were effective for tabular data, but limited in the information they could provide.

SQL Server has now been enhanced to participate fully in XML-based data services, and allow XML to be processed, stored, indexed, converted, queried and modified on the server. This has made complex application areas, such as data feeds, a great deal easier to implement and has greatly eased the provision of web services based on XML technologies.

XML has continued to develop and spawn new technologies. There are a number of powerful supersets of XML, such as XHTML, RSS, and XAML that have been developed for particular purposes. A number of technologies have been developed for creating, modifying, and transforming XML data. Some of these have been short-lived, but there are signs that the standards are settling down.

XML

Extensible Markup Language (XML) is a simple, flexible, text-based representation of data, originally designed for large-scale electronic publishing. XML is related to HTML, but is data-centric rather than display-centric. It was developed from SGML (ISO 8879) by employees of Sun (notably Jon Bosak) and Microsoft working for W3C, starting in 1996. In XML, as in HTML, tags mark the start of data elements. Tags, at their simplest, are merely the name of the tag enclosed in ‘<‘ and ‘>’ chevrons, and the end tag adds a ‘/’character after the ‘<‘, just like HTML. Attributes can be assigned to elements. The opening and closing tag enclose the value of the element. XML Tags do not require values; they can be empty or contain just attributes. Unlike HTML, XML tag names are not predefined, and are case-sensitive. In XML, there are few restrictions on what can be used as a tag-name. They are used to name the element. By tradition, HTML documents can leave out parts of the structure. This is not true of XML. XML documents must be ‘well formed’, to remove any chance of ambiguity. By ‘well formed’ they must:

  • Have a root element
  • Have corresponding closing tags to every tag (e.g. <;address></address>)
  • Have tags properly nested.
  • Have all attributes enclosed in quotes.
  • Have all restricted characters (‘<‘, ‘>’, ”’, ‘&’ and ‘”‘) properly ‘escaped’ by character entities (&lt;, &gt; &apos; &amp; &quot;).
  • Have matching end-Tags, case-insensitive.

A valid XML document is a well-formed document that is sensible, and conforms to the rules and criteria of the data structure being described in the document. An XML document can be validated against the schema provided by a separate XML Schema document, referenced by an attribute in the root element. This also assigns data types, and constraints, to the data in the document.

XML Support in SQL Server

SQL Server is fundamentally a relational database, conforming where it can to the SQL standards. XML has different standards, so that integration is made more difficult by the fact that the XML data type standards are not entirely the same as the relational data type standards. Mapping the two together is not always straightforward.

XML has considerable attractions for the DBA or Database developer because it provides a way to pass a variety of data structures as parameters, to store them, query and modify them. It also simplifies the process of providing bulk data-feeds. The challenge is to do this without increasing complexity or obscuring the clarity of the relational data-model.

XML’s major attraction for the programmer is that it can represent rowset (single table) and hierarchical (multiple-table) data, as well as relatively unstructured information such as text. This makes the creation, manipulation, and ‘persisting’ of objects far easier. XML can represent a complex Dataset consisting of several tables that are related through primary and foreign keys, in such a way that it can be entirely reconstructed after transmission.

XML documents can represent one or more typed rowsets (XML Information Set or ‘Infoset’). To achieve this, a reference to the relevant XML Schema should be contained in every XML document, or fragment, in order to data-type the XML content. SQL Server now provides a schema repository, or library, for storing XML schemas, and it will use the appropriate schema to validate and store XML data.

Loading XML

XML documents of any size are best loaded using the XML Bulk Load facility, which now has the ability to insert XML data from a flat file into an XML column. You can insert XML data from a file into base tables in SQL Server using the OPENROWSET table function, using the ‘bulk rowset Provider’, with an INSERT statement. The data can then be shredded to relational tables by using the xml.nodes function. (OpenXML can also be used. It is retained by SQL Server for compatibility with SQL Server 2000).

Storing XML

XML documents, XML fragments and top-level text nodes can be stored as XML. XML can be used like any other data type, as a table column, variable, parameter or function return-value. However, there are obvious restrictions due to the fact that, although stored as UTF-16, the XML data is encoded and cannot be directly compared with other XML data, neither can it be used as a primary or foreign key. It cannot have a unique constraint either. The XML data is stored in a binary format rather than ASCII.

Unlike other data types, the XML data type has its own methods to Create, Read, Update or Delete the elements within the XML document.

XML data can have default values, and can be checked by a variation of the RULE, where the validation is encapsulated within a user-defined function.

XML data types can be allocated data by implicit conversion from the various CHAR formats, and TEXT, but no others. There are no implicit conversions from XML data to other formats.

Checking XML (XML Schemas)

To specify the data type for an element or an attribute in an XML document you use a schema.

>XML documents are checked against XML Schemas. The XML Schema is a definition of the data structure used within an XML Document. This indicates, for example, whether a value such as “34.78” (which is stored as a text string within the XML) represents a character string, a currency value, or a numeric value.

If, for example, the XML document represents an invoice, the XML Schema describes the relationship between the elements and attributes, and specifies the data types for that invoice.

You can check, or validate, untyped XML, whether used in a column, variable or parameter, by associating it with an XML Schema. Once checked, it becomes ‘typed’. This ensures that the data types of the elements and attributes of the XML instance are contained, and defined, in the schema. These names are valid within the particular ‘namespace’ specified. An XML Schema definition is, itself, an XML document. These are catalogued in SQL Server as XML Schema collections, and shredded in order to optimise Schema validation. They are tied to specific SQL Schema within a database.

Using typed XML introduces integrity checking and helps the performance of XQuery.

Accessing Data in XML

XML Data type columns can be indexed, and manipulated using XQuery and XML Data Manipulation Language (XML DML), which adds ‘Insert’, ‘delete’ and ‘replace’ to XQuery.

To make data-access more effective, XML in SQL Server can be indexed. To be indexed, the XML must be a column in a table that already has a primary key. The index can be over the document structure, or for the values of the elements.

The XML data type can be viewed or modified by a number of methods. One can determine whether a node exists, get its value, retrieve it as table-result of a query, or modify its value.

XML can be read by the XML parser into a ‘Document Object Model’ (DOM, see below) and then accessed programmatically via methods and properties, but it is not really a suitable server-side technology due to the overhead of parsing the document into the model.

Shredding XML

The process of converting XML data into a format that can be used by a relational database is called ‘Shredding”, or decomposition. One can either use the NODES method on an XML data type or, from a Document Object Model (DOM), use the OpenXML function. OpenXML is retained in SQL 2005, but the NODES method is generally preferable because of its simplicity and performance.

Converting relational data to XML

XML fragments, or documents, can be produced from SQL Queries against relational tables, using the SELECT …For XML syntax. An inline XSD Format schema can be produced, and added to the beginning of the document. This is convenient but not covered by a W3C standard.

Converting XML to other formats

XML documents can be converted into other XML documents, or into formats such as HTML, using XSL Stylesheets (see below). These are themselves XML documents that provide a mixture of commands and text. It is applied to an XML document by processing it via a parser.

Querying XML Documents

XQuery

XQuery, derived in part from SQL, is the dominant standard for querying XML data. It is a declarative, functional query language that operates on instances of the XQuery/XPath Data Model (XDM) to query your XML, using a “tree-like” logical representation of the XML. With XQuery you can run queries against variables and columns of the XML data type using the latter’s associated methods.

XQuery has been around for a while. It evolved from an XML query language called Quilt, which in turn was derived from XML Path Language (XPath) version 1.0, SQL, and XQL.

XQuery has similarities with SQL, but is by no means the same – SQL is a more complete language. The SELECT statement is similar to XQuery’s language, but XQuery has to deal with a more complex data model.

The XQuery specification currently contains syntax and semantics for querying, but not for modifying XML documents, but these are made-good by extensions to XQuery, collectively called the XML Data Manipulation Language (XML DML). This allows you to modify the contents of the XML document. With XML DML one can insert child or sibling nodes into a document, delete one or more nodes, or replace values in nodes.

Microsoft thoughtfully provided extensions that allow T-SQL variables and columns to be used to bind relational data inside XML data. Server 2005 adds three keywords: insert, update, and delete. Each of these are used within the modify() method of the XML data type.

The XDM that XQuery uses is unlike the Document Object Model (DOM). Each branch (or “node”) of the XDM tree maintains a set of attributes describing the node. In the tree, each node has an XML node type, XDM data type information, node content (string and typed representations), parent/child information, and possibly some other information specific to the node type.

FLWOR

XQuery’s FLWOR expressions (For, Let, Where, Order by, and Return) iterates XML nodes using the for clause, limits the results using the where clause, sorts the results using the order by clause, and returns the results via the return clause. These constructs greatly extend the versatility of XQuery to make it comparable to SQL

XPath

XPath was designed to navigate an XML document to retrieve the documents elements and attributes. It also provides basic facilities for manipulation of strings, numbers and Booleans. It represents the document as a tree of nodes, and allows reference to nodes by absolute or relative paths. One can specify criteria for the nodes that are returned in square brackets.

XML Template Queries

An XML template query is an XML document with one or more TSQL or XPath queries embedded in it, allowing you to query an XML document. The results can be transformed with an XSLT stylesheet. Template queries are used in client code to update SQL Server data. They are templates with attributes and elements that specify the data that requires updating and how it is to be updated.

UpdateGram

An UpdateGram is an XML template that is used to insert, update or delete data in a database. It contains an image of the data before and after the required modification. It is usually transmitted to the server by a client application. Each element usually represents one record in a table. The data is ‘mapped’ either implicitly or explicitly. One can pass parameters to them.

DiffGram

This is an XML document format that is used to synchronise offline changes in data with a database server. It is very similar to an UpdateGram, but is less complex. It is generally used for ‘persisting’ the data in data objects.

Transforming XML data

XSL

XSL is a stylesheet language for XML that is used to transform an XML document into a different format. It includes XSLT, and also an XML vocabulary for specifying formatting. (XSL-FO) XSL specifies the styling of an XML, to describe how an XML document is transformed into another document.

Although the resulting document is often HTML, one can transform an XML document into formats such as Text, CSV, RTF, TeX or Postscript. An application designer would use an XSL stylesheet to turn structured content into a presentable rendition of a layout; they can use XSL to specify how the source content should be styled, laid out, and paginated onto some presentation medium. This may not necessarily be a screen display but might be a hand-held device, a set of printed pages in a catalogue, price-list, directory, report, pamphlet, or book.

XSLT

XSLT (XSL Transformations), a language for transforming XML documents into other XML documents, Is an intrinsic part of XSL. XSLT and XSL are often referred-to as if they were synonymous. However, XSLis the combination of XSLT and XSL-FO ( the XSL Formatting Objects).

The Document Object Model

The Document Object Model (DOM) is a platform- and language-neutral interface to enable programs and scripts to dynamically access and update the content, structure and style of XML documents.

XML represents data in a tree structure. Any parser will try to convert the flat text-stream representation of an XML or HTML document into a structured model. The Document Object model provides a standardised way of accessing data from XML, to query it with XPath/XQuery and manipulate it as an object. This makes it a great deal easier for application languages to read or manipulate the data, using methods and objects

The DOM defines the logical structure of the documents, and the way they can be accessed. It provides a programming interface for XML documents

SQL Server’s OpenXML function actually uses a DOM, previously created using the sp_xml_prepareDocument stored procedure. This function is a ‘shredder’ that then provides rowsets from the DOM.

XML Web Services

SQL Server 2005 will support web services based on SOAP. SOAP is a lightweight, stateless, one-way message protocol for exchange of information in a decentralized, distributed environment. SQL Server’s support makes it much easier for SQL Server to participate in systems based on Unix, Linux or mobile devices.

XML Web services can be placed in the database tier, making SQL Server an HTTP listener. This provides a new type of data access capability for applications that are centralized around Web services, utilizing the lightweight Web server, HTTPSYS that is now in the operating system, without Internet Information Services (IIS). SOAP can potentially be used in combination with a variety of other protocols other than HTTP but the HTTP-based service is the only one in current use;

SQL Server exposes a Web service interface, to allow execution of SQL statements and invocation of functions and procedures. Query results are returned in XML format and can take advantage of the Web services infrastructure of Visual Studio. Web service methods can be called from a .NET application almost like any other method.

A web service is created by:

  • Establishing an HTTP endpoint on the SQL Server instance, to configure SQL Server to listen on a particular port for HTTP requests.
  • Exposing Stored procedures or user-defined functions as Web Methods
  • Creating the WSDL

The web services can include SQL batches of ad-hoc queries separated by semicolons.

Glossary

Character entities
these are certain characters that are represented by multicharacter codes so as not to conflict with the markup.
Infoset
This is an XML document that represents a data structure and is associated with a schema.
Namespace
Namespaces are designed to prevent clashes between data items that have the same name but in different data structures. A ‘name’, for example, may have different meanings in different part of a data map. Namespaces are generally defined in XML Schemas. Elements in an XML document can be prefixed to attributes. SOAP Namespaces are part of SOAP messages and WSDL files
RSS
an RDF vocabulary used for site summaries.
SGML
This is the standard Generalised Markup Language. HTML and XML are applications of SGML
WSDL
Web Services Description Language (WSDL) an XML format for describing network services as a set of endpoints operating on messages containing either document-oriented or procedure-oriented information
XDM
The Data model used by Xquery to shred XML documents
XDR
XML-Data reduced, a subset of the XML-Data schema method.
XHTML
a language for rendering web pages, basically HTML that conforms to general XML rules, and can be processed as an XML document.
XML
XML is an acronym for Extensible Markup Language and is a language that is used to describe data and how it should be displayed.
XML Schema
An XML Schema is an XML document that describes a data structure and metadata rather than the data itself
XQuery
This a query language designed to be used to query XML data in much the same way that SQL is used but appropriate to the complex data structures possible in XML documents
XSD
A schema-definition vocabulary, used in XML Schemaa
XSL
A transformation language for XML documents: XSLT. Originally intended to perform complex styling operations, like the generation of tables of contents and indexes, it is now used as a general purpose XML processing language. XSLT is thus widely used for purposes other than XSL, like generating HTML web pages from XML data.
Well-formed XML document
A well-formed XML document is properly formatted in that the syntax is correct and tags match and nest properly. It does not mean that the data within the document is valid or conforms to the data definition in the relevant XML Schema
XML fragment
This is well-formed XML. that does not contain a root element
XQuery
an XML Query language, geared to hierarchical data

.

Happy reading

This article leads on to the XML Jumpstart Cribsheet which has practical examples