SQLXML Bulk Loader Basics

SQLXML isn't exactly new technology, but like the even more venerable BCP, it remains the quickest and most reliable way of heaving large quantities of data into SQL Server databases. SQLXML is very versatile, and once set up is wonderfully reliable ETL system, but isn't trivial to learn. Adam Aspin comes to the rescue with a simple guide.

At some point in their careers, most SQL Server developers will have faced the challenge of getting the data from an XML source file into SQL Server. Despite the ubiquity of the CSV format, and the rising star of JSON, a vast quantity of data is still exchanged as XML files.

For bulk loading XML data, the standard tools in the Extract-Transform-Load (ETL) toolkit tend to be the venerable OPENXML, or slightly more modern OPENROWSET/XQUERY, for the T-SQL purist, or for SSIS devotees the XML task.

Any of these solutions work well enough for small-to-medium sized files and even quite complex XML structures, but they often don’t scale to larger loads. Anything over 2GB will crash OPENXML or SSIS in my experience. In some cases, even if they can load a large file, the time taken could prove a threat to your SLAs.

In these circumstances, SQLXML Bulk Loader is your friend. It can import extremely large XML source data files into SQL Server at amazing speed, while preserving referential integrity, if required. Unfortunately, it’s not installed with SQL Server, by default, and it’s poorly explained in the SQL Server documentation. As a result, most developers either ignore it completely, or implement it poorly and it is unfairly perceived as “hard to use”.

This article attempts to put the record straight. It will describe briefly how to find and install SQLXML Bulk Loader and then how to use it to load three “classic” XML file structures:

  1. A simple XML format
  2. Multiple tables stored as XML
  3. A nested, moderately complex XML structure that maps to a relational table structure

To work through the examples, you’ll need to download the sample XML files from the “article rating” box at the top of this article. I’ve provided the kind of “semi-structured” source files that SQLXML Bulk Loader can digest with ease. You’ll also need to create a database in which to practice. In the examples, I’ve assumed you’ve created a directory called C:\SQLXMLArticle for the files, and created a database called CarSales_Staging. All techniques described will work with all versions of the database from 2005 through 2014.

In a subsequent article, I will explain some of the more advanced features of SQLXML Bulk Loader, from handling attribute-centric XML to optimizing the process for bulk loads via validating the XML and integrating SQLXML Bulk Loader with SSIS, and other tricks and techniques to help you get the most out of this tool.

SQLXML Bulk Loader Use Cases

SQLXML Bulk Loader is best used in the following situations:

  • When the source XML file is large. What is large? Well, if you were planning on using OPENXML, then that means any file over 2 gigabytes. The same upper limit is true of XML loaded into a variable and shredded using XQuery. With SSIS, it depends on the memory available, but I have experienced practical difficulties way below the 2 GB limit. In contrast, I have loaded XML files of tens of gigabytes using SQLXML Bulk Load.
  • When the XML source structure is relatively simple. In reality, this means a file consisting of tables and fields where the nested XML structure is not too complex. The source data cannot be too intricate, in XML terms, or it will not load. It is not for nothing that this is called “semi-structured” XML data.
  • When you wish to load multiple tables from the same source file.
  • When speed loading the data is important. In my tests, SQLXML Bulk Load loaded data at about 90 percent of native BCP speeds, for separate tables without relational links.

Be aware from the start that SQLXML Bulk Loader can only handle a certain level of complexity in the source file. Should you be faced with a file that SQLXML Bulk Loader refuses to load, your only solution will be one of the following:

  • Negotiate a simpler structure from the data supplier
  • Use one of the other available technologies mentioned above
  • Use XSLT to simplify or ‘flatten’ the file structure, to produce a file amenable to processing by SQLXML. XML transformation is a vast and completely separate topic and one that I will not be explaining it in this article

Preparing SQLXML Bulk Loader

SQLXML 4.0 was installed by default as far as SQL Server 2005 but from SQL Server 2008 onwards, it has to be installed as part of the Feature Pack, for SQL Server 2008, or downloaded separately. You can obtain version 4.0 SP1 at the following URL:

http://www.microsoft.com/en-gb/download/details.aspx?id=30403

SQLXML Bulk Loader exists in both 32- and 64-bit versions (there is even a 64-bit IA version, but I expect that few people will be needing this). The installation is simple and I will not waste your time explaining it here. Once complete, you will be ready to start using SQLXML Bulk Loader to load XML files into SQL Server.

Importing an XML File into SQL Server

SQLXML Bulk Loader is a COM (Component Object Model) object that allows you to load semi-structured XML data into SQL Server tables. To make it work, you will need the following:

  • An XML source file.
  • A destination table in a SQL Server database.
  • An XSD schema file that maps the XML to the destination table – this is the interesting part of the process
  • A Visual Basic Script file to execute the XML load process.

Listing 1 shows the XML source, which in this simple example will be used to “bulk import” 2 rows in a SQL Server table. Throughout the article, the code listing caption provides the name of the relevant file in the code download bundle.

Listing 1: Clients_Simple.Xml

An XML document is checked for being well-formed, but it is not validated. If the XML document is not well-formed, processing is cancelled.

This example uses XML elements, but attribute-centric XML source data can also be loaded with SQLXML Bulk Loader. I give a short example of this in a subsequent article.

Inline schemas are not supported
If you have an inline schema in the
source XML document, SQLXML will ignore it.

Listing 2 creates the SQL Server table, the destination for this sample data load.

Listing 2: Client_XMLBulkLoad.sql

Listing 3 shows the XSD schema file that maps the source XML to the destination table. I will explain its workings once the data is loaded.

Listing 3: SQLXMLBulkLoadImport_Simple.xsd

Finally, Listing 4 shows the VBScript file used to invoke SQLXML Bulk Load and load the data.

Listing 4: SQLXMLBulkload.vbs

…or in PowerShell …

Listing 5: SQLXMLBulkload.ps1

If the .xml and .xsd files are in the same directory as the .vbs file then you will not need to add the full paths after the .Execute command in the VBScript file. However it is probably a good habit to use full path details, as a matter of principle.

Double-click the SQLXMLBulkload.vbs file to run the bulk load. If all goes well, you should be able to query the Client_XMLBulkLoad table and see the data from the XML source file correctly loaded.

A Closer Look at the Schema File

The schema (XSD) file contains a few extra tidbits, over and above the Microsoft mapping schema, that allow it to perform its job so efficiently. Essentially, it extends the basic schema with the attributes that allow SQLXML to channel the source data into the correct tables and fields.

Perhaps the easiest way to understand what is special about the XSD that SQLXML bulk loader uses is to compare it with a “plain vanilla” XSD for the same XML source file. Listing 6 shows the XSD without the extensions that SQLXML bulk loader needs.

Listing 6: Vanilla.XSD file

As you can see this version is much simpler. So what was added to the basic schema to make the data load work? Only four annotations were necessary to apply all the required constraints on the data. They are:

  • xmlns:sql = "urn:schemas-microsoft-com:mapping-schema" – This extension to the namespace specifies how the schema is extended.
  • Sql:relation – This extension indicates to SQLXML the table into which the data is to be loaded for this XML element. As you can see from Listing 3, this is added to the <Client> element to indicate that this particular element maps to the Client_XMLBulkLoad table in the SQL Server database.
  • sql:field – This extension indicates to SQLXML the field in the destination table into which the data is loaded. This is added to each of the nested elements in the <Client> element to indicate which field the XML element is mapped to in the Client_XMLBulkLoad table in SQL Server
  • sql:is-constant = "1" – This extension tells SQLXML that the <CarSales> element is a container and not a data element – and so does not need to be loaded into a data table. This annotation has to be added to all “parent container” elements.

Getting the schema file right is the hard part of using SQLXML bulk loader. You will probably spend more time on crafting this file than with any other part of the SQLXML bulk load process. Consequently it is worth ensuring that you have understood the XSD extensions before attempting a complex data load. You may even find that practicing on a simple XML file to start with can reap dividends. You can then move on to tackling a large and more complex XML source file once you are confident that you have mastered the basics.

However, you do not need to hand-craft the entire schema file from scratch. There are many tools available that can analyze an XML file and produce a basic XSD file that you can then extend by adding the Microsoft mapping extensions. One tool that is available with any Visual Studio installation is xsd.exe. It will take an XML source file and return a schema file in all but the most arcane cases. You can find out more about this at: XML Schema Definition Tool (Xsd.exe). If this tool does not meet your needs, then you can try a PowerShell alternative that is described at: Loading Data With PowerShell.

How the VBScript File Works

The .vbs file that invokes SQLXML Bulk Loader can be extremely simple, and only needs the following items:

  • The object creation statement – to invoke the SQLXML Bulk Load COM object.
  • A standard OLEDB connection string, containing, at a minimum:
    • The server and, if required, instance name (Data Source in this example).
    • The destination database name (CarSales_Staging in this example).
    • The SQL Server security information (Windows integrated security or SQL Server security).
  • The Execute command, which provides paths to the XSD and XML files – in that order.
  • An ErrorLog file. While not compulsory, this is invaluable for debugging the process.
  • A SET command to dispose of the COM object.

This simple VB script assumes that the destination table already exists (although you can create the table as part of the load as you can see in a companion article). It also presumes integrated security. It is interesting that, somewhat counter-intuitively, the XML schema (.xsd) file is passed as the first parameter to the XML Bulk Loader. The XML file itself is passed in as the second parameter.

Troubleshooting

Two things will tell you that the XML data load has worked:

  • There is no ErrorLog file, or if the old ErrorLog file is removed (assuming that you have requested a log file)
  • The fact that the data loaded correctly into the destination table(s)

Should the data not load, then your first port of call is the ErrorLog file that SQLXML Bulk Load created, assuming that you added the objBL.ErrorLogFile = "Log File and path" parameter. Fortunately, these error log files are very explicit, and always prove to be an invaluable source of debugging information if you encounter load issues.

There are a few precautions that you can take to prevent persistent recourse to the error log file becoming a way of life. Nearly all your attention should probably focus on the schema file. In my experience, this is the most common source of problems, because it generally involves a certain amount of hand-crafting. Potential problems include:

  • Making sure that there are no spaces immediately inside all the double-quotes that are used in the xsd:element definition.
  • Ensuring that you do not inadvertently close elements such as xsd:sequence and xsd:complextype – or even some of the xsd:element definitions.
  • Respecting case-sensitivity. So in the preceding example, having an element called “Country”, and a schema mapping element named “country” would cause the process to fail.

An interesting “feature” concerns the VBScript file. You have to be sure that you do not leave any spaces in the database=DatabaseName part of the script file. Spaces here, even if they are allowed elsewhere in the connection string, will cause a cryptic error message complaining that the table object cannot be found.

Loading Multiple Tables at Once from a Single XML Source File

One advantage that an XML file can have over a CSV file is the capacity to store multiple unrelated tables in a single source file. There will doubtless be occasions when you will want to load data from an XML source file that contains data for several separate tables. With the SQLXML Bulk Load executable and a suitably-crafted XSD file, you can load data into multiple tables from a single XML file in a single process. SQLXML Bulk Loader allows you, in effect, to transfer an entire subset of data in a single file, ready to be loaded simultaneously into multiple tables in a destination database. The data load can also be blindingly fast.

This technique is predicated on the fact that the source XML is very simple in layout. It also presumes that you are essentially loading data from a “table dump” in an XML file, where each table is stored sequentially in the XML file. You are also nearly always best advised to drop any indexes for the duration of the load, and recreate them afterwards when loading multiple tables. Moreover it is also best to disable foreign key constraints during the data load and reapply them afterwards.

To emulate the “table dump” scenario, the data file used in this example, shown in Listing 7, contains two tables in XML format, Invoice and Invoice_Lines.

Listing 7: SQLXMLSourceDataMultipleTables.xml

The file only has one record for Invoice and two for Invoice_Lines, but in reality there could be dozens of tables, each containing millions of records. Note that there is no presumption in the XML of any relational link between the Invoice and Invoice_Lines elements.

Listing 8 shows the XSD file that maps the source XML data to two tables in a SQL Server database, which are also called Invoice and Invoice_Lines.

Listing 8: SQLXMLBulkLoadImportMultipleTables.xsd

This schema file uses the same extensions as the previous one that you saw, only this time there are two “tables” of data – each mapped using the sql:relation annotation. Each of the XML elements containing data inside the XML container element is mapped to the corresponding database field with the sql:field annotation.

As you can see, each XML element that maps to a table is a separate XML node, and the order of the elements is unimportant. What is important, however, is to have a single root node, just as you would for a conventionally well-formed XML file.

The only potential issue with this approach is that the XSD file can become extremely complex if you have to handle dozens of tables where each one can have potentially hundreds of columns. It is not that the XSD file is necessarily difficult to create, just arduous. So arm yourself with patience, and be prepared for some repetitive testing.

The penultimate stage is to create the destination tables, as shown in Listing 8.

Listing 9: tblInvoiceMulti.Sql

Once the destination tables exist, you can create and execute a VBScript file just as you did before. Here is the script to use.

Listing 10: SQLXMLBulkloadMulti.vbs

… or in PowerShell

Listing 11: SQLXMLBulkloadMulti.ps1

Once again the source data is loaded into the destination database, this time loading multiple tables simultaneously, from a single file in a single operation.

There is nothing to stop you from enforcing referential integrity through the addition of primary and foreign key constraints once the load has finished, so this technique can become a method for transferring data between databases. Certainly, it is not as simple as a backup and restore, but it can prove useful in certain circumstances, such as when:

  • You need to transfer data between multiple versions of SQL server, and frequently into older versions of the database.
  • The environment that you are using does not allow for direct connections between databases.
  • CSV files prove unreliable for the data that you are using.
  • You need to use the same data for loading into multiple different databases (Oracle, MySQL or DB2 for instance) – and XML is the most appropriate “Lingua Franca”.
  • You need a rapid and reliable data load process.

On another note, it is worth remarking that the XML data file can contain more “tables” and “records” than are mapped in the schema file. However, only those entities that are specified in the XSD file will be loaded.

Loading and Shredding Relational Tables from an XML Source File

XML is nothing if not infinitely varied. Frequently, the XML source file that will be more intricate than the simple “table dump” structure of the previous example. Often, the data that makes up several tables is presented in a hierarchical way inside a single XML element.

Listing 12 illustrates a simple version of this nested XML pattern.

Listing 12: SQLXMLSourceDataRelatedTables.xml

Although this file only contains a small amount of data it maps to a classic Invoice header / Invoice Item pattern, where the <Invoice> element contains the <Invoice_Lines> element. We can load this kind of data directly into a relational table structure using SQLXML Bulk Loader.

Drop the two tables used in the previous example and re-create them with a FOREIGN KEY constraint, as shown in Listing 11.

Listing 13: LoadRelationalXML.sql

Create an appropriate XML schema, as shown in Listing 12 (I’ll explain this file in a little more detail shortly).

Listing 14: SQLXMLBulkLoadImportReferential.xsd

Modify the objBL.Execute statement in the existing SQLXMLBulkloadMulti.vbs file so that it refers to the new .XML and .XSD files, as shown in Listing 13. Save this file as SQLXMLBulkloadNested.vbs.

Listing 14: SQLXMLBulkloadNested.vbs (extract)

This example builds on the approach from the previous example. However, it has the following prerequisites:

  • Destination tables with PRIMARY KEY to FOREIGN KEY relationships. This is why I suggested that you re-create the destination tables, as the full DDL makes the key relationships clearer.
  • Source data where the “parent-child” data relationship is directly expressed in the XML structure. This is the case in the source data where the <Invoice> element contains one or more <Invoice_Lines> elements.

More generally, if the source XML is hierarchically structured to map to a relational source, then SQLXML Bulk Load can shred the data into SQL Server tables while maintaining foreign key relationships.

The Schema File for Nested XML Structures

Here, as was the case in the previous example, both the key to success and any potential complexity reside in the accurate definition of the XSD file. Indeed, everything that is essential to a successful load is found in the schema file:

  • Firstly, note the <xsd:annotation> and <xsd:appinfo> elements at the start of the schema file. The foreign key mapping is specified inside these elements. It must be present – and correctly defined – for the load to work.
  • Inside these elements, you have the sql:relationship name = "InvoiceToLine" element. This identifies each relationship between tables as part of the appinfo, as it will then be used as part of the sql:element definition for a “table” in the schema file.

The relationship definitions are as follows:

  1. parent = "Invoice": The “parent” table.
  2. parent-key = "ID": The primary key column of the “parent” table.
  3. child = "Invoice_Lines": The “child” table.
  4. child-key = " InvoiceID ": The foreign key column of the “child” table.
  5. sql:relationship = "InvoiceToLine": This attribute Inside the XML element, which contains the child elements specifies the relationship to use that was defined previously in the schema file.

As you can see from this information, the ID field is now considered as a PRIMARY KEY in the Invoice table, and the InvoiceID is taken to be the FOREIGN KEY in the Invoice_Lines table. Executing the .vbs script loads the data into the two tables, while maintaining the ID as the FOREIGN KEY constraint for the Invoice_Lines.

You also need to know that SQLXML Bulk Load generates records as their nodes enter into scope, and sends those records to SQL Server as their nodes exit scope. This means that all the data for the record must be present within the scope of the node. Consequently, the data for the “child” table cannot precede the data for the “parent” table in the source file. To make the point in another way, if you have a mapping schema that defines a PRIMARY KEY / FOREIGN KEY relationship between two tables (such as between Invoice and Invoice_Lines), the table with the PRIMARY KEY must be described first in the schema. The table with the FOREIGN KEY column must appear later in the schema.

Performance

An important practical point about loading XML data using this particular design pattern is speed, or rather lack of it. You will never attain the turbo-charged load times with nested XML that you can attain with the “table dump” approach that you saw previously. This is not to say that you cannot load gigabytes of data in reasonable timescales, just that you will never break any speed limits using this approach.

Equally, there are limits to the complexity of the XML source files that SQLXML Bulk Loader can load gracefully (or at all, in some cases). It is well-nigh impossible to predict exactly when you will hit a ceiling on the data ingestion capacities of this tool, but you do need to be aware that there are files that appear to be too intricate for it to handle.

Conclusion

Welcome to the world of SQLXML Bulk Loader! It’s a valuable tool that has been available for many years and yet is still underused, in my opinion. You have seen how to load multiple types of XML data files, from the simple table, via multiple separate tables through to more complex, nested XML structures. While this tool will never win any prizes for making developer’s lives easier, it certainly makes up for this in its raw power and speed. If you suspect that your XML files will ever grow to any appreciable size, then you could well be advised to get to know this versatile addition to your ETL toolkit sooner rather than later. It just could save your reputation.

Look out for the next installment, covering some of the more advanced features of SQLXML Bulk Loader!

SQL Server 2012 Data Integration Recipes
If you like what you read in this article and want to learn more about data integration using SQL Server, then please take a look at my book, SQL Server 2012 Data Integration Recipes (Apress, December 2012)